New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: have clients resend ops on pg split #13235

Merged
merged 35 commits into from Feb 15, 2017

Conversation

Projects
None yet
4 participants
@liewegas
Member

liewegas commented Feb 2, 2017

  • adds require_luminous_osd OSDMap flag
  • adds luminous server feature and a resend_on_split feature (same value)
  • adds a placeholder M feature
  • adds generic bits forcing upgrades to stop at L, enforcement for require_luminous_osd flag
  • fixes last_split_epoch tracking on OSD (well enough for our purposes)
  • renames last_force_op_resend to _preluminous, and adds a new one for post-
  • makes objecter resend on split to osds with the resend-on-split feature
  • makes osd discard all ops prior to split
  • fixes backoff to operate on a per-spg_t basis
  • renames backoffs to make more sense
  • fixes fragile objecter code around tiering and pg_num
@jdurgin

should add an upgrade test that sets require_luminous_osds too

@@ -1965,6 +1965,15 @@ bool OSDMonitor::preprocess_boot(MonOpRequestRef op)
}
}
// make sure upgrades stop at luminous
if ((m->osd_features & CEPH_FEATURE_SERVER_M) &&
!osdmap.test_flag(CEPH_OSDMAP_REQUIRE_LUMINOUS)) {

This comment has been minimized.

@jdurgin

jdurgin Feb 3, 2017

Member

looking at the places we use the jewel and kraken flags/features, there are a couple more that need luminous-flag/feature handling:

  • create_initial()
  • encode_pending()
  • preprocess_boot() - second check needed to prevent pre-luminous OSDs booting when the flag is set
@@ -6480,6 +6480,20 @@ bool OSDMonitor::prepare_command_impl(MonOpRequestRef op,
ss << "not all up OSDs have CEPH_FEATURE_SERVER_KRAKEN feature";
err = -EPERM;
}
} else if (key == "require_luminous_osds") {

This comment has been minimized.

@jdurgin

jdurgin Feb 3, 2017

Member

need to add to MonCommands.h too

if (!(features & CEPH_FEATURE_NEW_OSDOP_ENCODING)) {
// this was the first post-hammer thing we added; if it's missing, encode
// like hammer.
v = 21;
}
if ((features &
(CEPH_FEATURE_RESEND_ON_SPLIT|CEPH_FEATURE_SERVER_KRAKEN)) !=
(CEPH_FEATURE_RESEND_ON_SPLIT|CEPH_FEATURE_SERVER_KRAKEN)) {

This comment has been minimized.

@jdurgin

jdurgin Feb 3, 2017

Member

this is a bit redundant since FEATURE_RESEND_ON_SPLIT aka FEATURE_SERVER_LUMINOUS implies FEATURE_SERVER_KRAKEN

This comment has been minimized.

@liewegas

liewegas Feb 3, 2017

Member

oops, should have been JEWEL. this is needed because we are recycling feature bits and older code will have these for other reasons. we really need to introduce a set of helpers to manage this mess...

@@ -2735,7 +2748,7 @@ int Objecter::_calc_target(op_target_t *t, bool any_change)
pg_num,
t->sort_bitwise,
sort_bitwise,
pg_t(prev_seed, pgid.pool(), pgid.preferred()))) {

This comment has been minimized.

@jdurgin

jdurgin Feb 3, 2017

Member

why remove pgid.preferred()?

This comment has been minimized.

@liewegas

liewegas Feb 3, 2017

Member

support was removed like 8 years ago; should really scrub it from the code.

@@ -2811,13 +2824,19 @@ int Objecter::_calc_target(op_target_t *t, bool any_change)
if (need_resend) {
return RECALC_OP_TARGET_NEED_RESEND;
}
if (con &&
con->has_features(CEPH_FEATURE_RESEND_ON_SPLIT |

This comment has been minimized.

@jdurgin

jdurgin Feb 3, 2017

Member

doesn't RESEND_ON_SPLIT imply SERVER_JEWEL?

This comment has been minimized.

@liewegas

liewegas Feb 3, 2017

Member

again, need to check for both bc we recycled the feature bit

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 4, 2017

http://pulpito.ceph.com/sage-2017-02-03_23:05:46-rados-wip-pg-split-interval---basic-smithi/

the one (non-ntp) failure looks like the mon failing to resend a pg_create message. this code is about to be rewritten for luminous so i'm not sure it's that important.

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 4, 2017

rebased (minor conflict with the ENXIO change)

liewegas added some commits Feb 1, 2017

osd/osd_types: add set_last_force_op_resend() accessor and use it
Signed-off-by: Sage Weil <sage@redhat.com>
osd/osd_types: last_force_op_resend -> last_force_op_resend_preluminous
Rename the current last_force_op_resend for legacy clients, and add a new
one that only applies to new clients that have the new
CEPH_FEATURE_OSD_NEW_INTERVAL_ON_SPLIT feature.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: discard ops based on either new or old lfor and features
If the client has the new feature bit, use the new field; if they have the
older feature bit, use the old field.

Note that there is no change to the Objecter: last_force_op_resend is
still the "current" field that it should pay attention to.

Signed-off-by: Sage Weil <sage@redhat.com>
mon/OSDMonitor: make pre-luminous clients resend ops on split
Pre-luminous clients do not understand that a split PG forms a new
interval.  Make them resend ops to work around this.

Signed-off-by: Sage Weil <sage@redhat.com>
@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 13, 2017

I reviewed from "osdc/Objecter: populate both actual pgid and full has in MOSDOp" on in the faster dispatch PR and only had pretty minor comments. I'd missed some of the other bits you were ripping out though so I will take a look at this as well.

liewegas added some commits Feb 2, 2017

osdc/Objecter: resend ops on pg split if osd has CEPH_FEATURE_RESEND_…
…ON_SPLIT

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: discard ops from before the last split
New clients will resend.

Old clients will see a last_force_op_resend (now named
last_force_op_resend_preluminous in latest code) and resend.

We know this because we require that the monitors upgrade to luminous
before the OSDs, and the new mon code sets this field on split.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: fix tracking of last_epoch_split
Note that it is only (currently) important that this value be accurate
on the current OSD since we only use this value (currently) to discard
ops sent before the split.  If we are getting the history from a different
OSD in the cluster that doesn't have an up to date value it doesn't matter
because that implies a primary change and also a client resend.

Signed-off-by: Sage Weil <sage@redhat.com>
message/MOSDOp: build native hobject_t
Drop unneeded snapid_t snapid and object_locator_t, which just duplicate
hobject_t fields.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: make use of MOSDOp::get_hobj()
Prefer this to get_object_locator() whereever possible.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: generalize map_to_pg
So we can do this without constructing an object_locator_t.

Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: remove reassert_version
We never populate this since we never get an ack.

Signed-off-by: Sage Weil <sage@redhat.com>
messages/MOSDOp: remove unused reassert_version
Signed-off-by: Sage Weil <sage@redhat.com>
messages/MOSDOp: add get_raw_pg()
Many current users expect a full hash value; make that explicit.

Signed-off-by: Sage Weil <sage@redhat.com>
messages/MOSDOp: new encoding w/ actual pgid separate from hobject hash
New clients will see an actual pgid as well as a full has value in the
hobj.  Old clients will continue to see a single (full) hash value.

Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: populate both actual pgid and full has in MOSDOp
New clients need the actual pgid as well as the full hash (as part of the
target hobj).  Old clients only use the full hash value.  We need to pass
both to MOSDOp so it can encode based on the target features.

Signed-off-by: Sage Weil <sage@redhat.com>
messages/MOSDOp: take spg_t, not pg_t, and drop old ctor
Signed-off-by: Sage Weil <sage@redhat.com>
osd: drop osd_debug_drop_op_probability
This is unused and not terribly useful.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: make all fast dispatch ops MOSDFastDispatchOp children
Define common get_spg() and get_map_epoch() methods.

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas added some commits Feb 7, 2017

osd: move internal in-memory types to osd_internal_types.h
Things like ObjectContext and lock state that are internal to the OSD
do not need to be in osd_types and shared with other parts of the code
base.

Notably, this fixes the problem with OpRequest needing things from
osd_types.h (osd_reqid_t for starters).  Others to follow.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: remove MOSDPGMissing
Removed 7c414c5 (pre-bobtail).

Signed-off-by: Sage Weil <sage@redhat.com>
osd: explicitly enumerate ops we can dispatch
This prevents random messages from falling into and OpRequest and
dispatch_op().

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PG: no need to split op waiting lists
Clients are now expected to resend on split, and there is already an
interval change.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: is_acting_osd_shard
Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: force pg_read ops to ignore cache overlay
pg_read is only used for PG listing and hit_set_{list,get}; these
operations can't and shouldn't consider the tiering overlay.

This makes the _calc_target behavior with the explicit pgid make sense;
otherwise, what would it mean to try to read pg x.1 from pool x and get
redirected to pg y.1 in pool y?

Signed-off-by: Sage Weil <sage@redhat.com>
osd/OSDMap: make is_acting_osd_shard an explicit spg_t check
Ensure that the ps value is < the pool pg_num.

Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: use overlay pg_pool_t for subsequent calculations
We use pi for pg_num and other values below; we need to update accordingly
if we follow the overlay.

Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: simplify pgid translation
All callers now pass in an explicit pgid, including pg listing.  Since
we resend ops on split, there is not need to do any translation here,
even for the jewel and kraken osds that can handle a full hash value.

Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: recalculate target_* on every _calc_target call
Any time we are asked to calculate the target we should apply the
pool tiering parameters.  The previous logic of only doing so when the
target hadn't been calculated didn't make a whole lot of sense, and broke
our update of *pi that is needed to get the correct pg_num for the target
pool.  This didn't really matter for old clusters that take the raw pg,
but for luminous and beyond we need the exact spg_t which requires a
correct pg_num.

Signed-off-by: Sage Weil <sage@redhat.com>
messages/MOSDBackoff: add spg_t to message
and make it an MOSDFastDispatchOp.

Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: manage backoffs per-spg_t
A backoff [range] is defined only within a specific spg_t; it does not
pass anything to children on split, or to another primary.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: manage backoffs per-pg; drop special split logic
Switch backoffs to be owned by a specific spg_t.  Instead of wonky split
logic, just clear them.  This is mostly just for convenience; we could
conceivably only clear the range belonging to children (just to stay
tidy--we'll never get a request in that range) but why bother.

The full pg backoffs are still defined by the range for the pg, although
it's a bit redundant--we could just as easily do [min,max).  This way we
get readable hobject ranges in the messages that go by without having to
map to/from pgids.

Add Session::add_backoff() helper to keep Session internals out of PG.h.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: rename backoff config options
Signed-off-by: Sage Weil <sage@redhat.com>
osdc/Objecter: force pg_command ops to ignore overlay
Signed-off-by: Sage Weil <sage@redhat.com>
osd/Session: fix race between have_backoff() and clear_backoffs()
We may return a raw pointer that is about to get deallocated by
clear_backoffs().  Fix by returning a reference, preventing the free.

Signed-off-by: Sage Weil <sage@redhat.com>
osd: fix backoff vs reset race
In OSD::ms_handle_reset, we clear session->con before removing any
backoffs.  That means we have to check if con has been cleared after any
call to have_backoff, lest we race with ms_handle_reset and it removes the
backoffs but we don't realize our client session is disconnected.

Introduce a helper to do both these checks in a safe way, simplifying
callers while we're at it.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas

This comment has been minimized.

Member

liewegas commented Feb 14, 2017

http://pulpito.ceph.com/sage-2017-02-14_06:55:13-rados-wip-pg-split-interval---basic-smithi/

failures unrelated:

  • misordered op bc pg log too short
  • tcmalloc vs dlopen deadlock
  • live obc after flush--saw this on master this morning too
  • clock sync
@liewegas

This comment has been minimized.

Member

liewegas commented Feb 14, 2017

retest this please

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 14, 2017

How did you check that the pg log was too short? That's not generally an issue in the nightlies AFAIK and is exactly the kind of bug we'd see in this kind of PR...

@liewegas

This comment has been minimized.

Member

liewegas commented Feb 14, 2017

@gregsfortytwo

Sorry for the delay. This looks good to me, just a couple things.

case MSG_OSD_PG_INFO:
case MSG_OSD_PG_TRIM:
case MSG_OSD_BACKFILL_RESERVE:
case MSG_OSD_RECOVERY_RESERVE:

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 15, 2017

Member

Huh, does that mean every random message we've received has been grabbing an OSD lock and going through this process?

This comment has been minimized.

@liewegas

liewegas Feb 15, 2017

Member

No, an unrecognized message asserts in dispatch_op later; it just seemed better to enumerate them here (and not assert on a stray message type) than to assert in dispatch_op. (This after i spent several minutes trying to figure out which messages took this path.)

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 15, 2017

Member

I guess it's better torn iao fast, we'll just have to remember to add them to this enumeration whenever we make new message types (which IIRC is why it wasn't done before). Too bad we don't have a good way to switch on the new intermediate classes.

class MOSDFastDispatchOp : public Message {
public:
virtual epoch_t get_map_epoch() const = 0;
virtual spg_t get_spg() const = 0;

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 15, 2017

Member

This subclassing should also let us remove the templated replica_op_required_epoch() and handle_replica_op() functions in OSD.

This comment has been minimized.

@liewegas

liewegas Feb 15, 2017

Member

Yeah, the fast dispatch patch does exactly that

@@ -20,6 +20,7 @@
#include "common/hobject.h"
#include "osd/osd_types.h"
#include "osd/osd_internal_types.h"

This comment has been minimized.

@gregsfortytwo

gregsfortytwo Feb 15, 2017

Member

This seems like a pretty obtuse way to get osd_internal_types where it needs to go. Maybe it's included here by happenstance but shouldn't we place it in the right classes?

This comment has been minimized.

@liewegas

liewegas Feb 15, 2017

Member

It was just the first (and apparently only) place I hit a build error and added it as there weren't any further errors after that.

@gregsfortytwo

This comment has been minimized.

Member

gregsfortytwo commented Feb 15, 2017

Reviewed-by: Greg Farnum gfarnum@redhat.com

@jdurgin, I think this has covered all your concerns. Are we waiting on anything else?

@liewegas liewegas merged commit eb491a1 into ceph:master Feb 15, 2017

2 of 3 checks passed

default Build finished.
Details
Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details

@liewegas liewegas deleted the liewegas:wip-pg-split-interval branch Feb 15, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment