New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: process deletes during recovery instead of peering #15952

Merged
merged 20 commits into from Jul 19, 2017

Conversation

Projects
None yet
4 participants
@jdurgin
Member

jdurgin commented Jun 27, 2017

With a large number of deletes in a client workload, performing them during peering can easily saturate a disk and cause very high latency, since it does not go through the op queue or do any batching.

To fix this, make recovery handle these deletes instead of performing them during peering.

Fixes: http://tracker.ceph.com/issues/19971

@jdurgin

This comment has been minimized.

Member

jdurgin commented Jun 27, 2017

retest this please

@liewegas

This looks good! I think we need the min_epoch pieces or else we may get odd ordering issues between messages (everything else sets min_epoch and if these don't they may get blocked when others don't). (Also it's faster.)

pg_shard_t from;
spg_t pgid; ///< target spg_t
epoch_t map_epoch;

This comment has been minimized.

@liewegas

liewegas Jun 27, 2017

Member
  epoch_t map_epoch, min_epoch;
...
  epoch_t get_map_epoch() const override {
    return map_epoch;
  }
  epoch_t get_min_epoch() const override {
    return min_epoch;
  }
pg_shard_t from;
spg_t pgid;
epoch_t map_epoch;

This comment has been minimized.

@liewegas

liewegas Jun 27, 2017

Member

should include min_epoch here too

@@ -3836,15 +3871,33 @@ class pg_missing_set : public pg_missing_const_i {
}
void encode(bufferlist &bl) const {
ENCODE_START(3, 2, bl);
ENCODE_START(4, 2, bl);

This comment has been minimized.

@liewegas

liewegas Jun 27, 2017

Member

If we can get a feature bit there, I would instead switch up the encoding so that the flags byte is encoded adjacent to each entry. It will be painful this release, but as soon as luminous is out we can rip out the compat code.

if (!needs_recovery(hoid))
return true;
if (is_deleted(hoid))
return false;

This comment has been minimized.

@liewegas

liewegas Jun 27, 2017

Member

it seems like a deleted object should be readable (in the sense that no recovery is needed?). I guess that is another condition to check in the io path.. this just makes reads on deleted objects block until the delete happens?

This comment has been minimized.

@jdurgin

jdurgin Jun 27, 2017

Member

I figured the conservative path would be simpler here - treating deletes more like other writes, and not requiring any additional blocking or shenanigans with ObjectContext in each path that tries to use this object.

Like you said it just blocks reads on objects that were deleted until the delete occurs, which should not be a major part of many workloads, and didn't seem worth optimizing for. This is less strict than the current code, which would end up blocking all i/o for all deletes due to being part of peering.

reply->from = get_parent()->whoami_shard();
reply->set_priority(m->get_priority());
reply->pgid = spg_t(get_parent()->get_info().pgid.pgid, m->from.shard);
reply->map_epoch = m->map_epoch;

This comment has been minimized.

@liewegas

liewegas Jun 27, 2017

Member

min_epoch = ... everywhere too

This comment has been minimized.

@jdurgin

jdurgin Jun 27, 2017

Member

good point on min_epoch - I didn't realize almost every FastDispatchOp used it that way - sounds like it could be refactored into the base class later

@@ -31,6 +31,9 @@ struct MOSDPGRecoveryDelete : public MOSDFastDispatchOp {
epoch_t get_map_epoch() const override {
return map_epoch;
}
epoch_t get_min_epoch() const override {
return min_epoch;
}

This comment has been minimized.

@liewegas

liewegas Jun 28, 2017

Member

needs to be encoded/decoded too

list<pair<hobject_t, eversion_t> > objects;
epoch_t get_map_epoch() const override {
return map_epoch;
}
epoch_t get_min_epoch() const override {
return min_epoch;
}

This comment has been minimized.

@liewegas

liewegas Jun 28, 2017

Member

needs to be encoded/decoded too

@@ -75,6 +76,7 @@ void PGBackend::send_recovery_deletes(int prio,
new MOSDPGRecoveryDelete(get_parent()->whoami_shard(),
target_pg,
get_osdmap()->get_epoch());
msg->min_epoch = min_epoch;

This comment has been minimized.

@liewegas

liewegas Jun 28, 2017

Member

maybe make it a ctor arg so that it isn't ever forgotten

epoch_t min_epoch = get_parent()->get_last_peering_reset_epoch();
for (const auto& p : deletes) {
auto shard = p.first;
auto objects = p.second;

This comment has been minimized.

@yuyuyu101

yuyuyu101 Jun 28, 2017

Member

I'm not sure. is the type of 'objects' will be vector<>&?

This comment has been minimized.

@jdurgin

jdurgin Jun 28, 2017

Member

you mean it should be auto&? probably

@liewegas liewegas modified the milestone: luminous Jul 6, 2017

@@ -8276,6 +8284,14 @@ bool OSDMonitor::prepare_command_impl(MonOpRequestRef op,
err = -EPERM;
goto reply;
}
} else if (key == "recovery_deletes") {
if (osdmap.get_up_osd_features() & CEPH_FEATURE_OSD_RECOVERY_DELETES) {

This comment has been minimized.

@liewegas

liewegas Jul 13, 2017

Member

best to use the HAVE_FEATURE macro here

@jdurgin

This comment has been minimized.

Member

jdurgin commented Jul 13, 2017

can clean up several (maybe all) of the 'deletes_during_peering' params now that that's part of the missing set

jdurgin added some commits Jun 24, 2017

message, osd: add request/response messages for deletes during recovery
The existing BackfillRemove message has no reply, and PushOps have too
much logic that would need changing to accomodate deletions.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd: add a 'delete' flag to missing items and related functions
This will track deletes that were in the pg log and still need to be
performed during recovery. Note that with these deleted objects we may
not have an accurate 'have' version, since the object may have already
been deleted locally, so tolerate this when examining divergent entries.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd/PG: handle deletes in MissingLoc
There's no source needed for deleting an object, so don't keep track
of this. Update is_readable_with_acting/is_unfound, and add an
is_deleted() method to be used later.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd/PrimaryLogPG,PGBackend: handle deletes during recovery
Deletes are the same for EC and replicated pools, so add logic for
handling MOSDPGRecoveryDelete[Reply] to the base PGBackend
class.

Within PrimaryLogPG, add parallel paths for starting deletes,
recover_missing() and prep_object_replica_deletes(), and update the
local and global recovery callbacks to deal with lacking an
ObjectContext after a delete has been performed.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd/PrimaryLogPG: check whether clones missing from the cache are rec…
…overing

This appears now that deletes are not processed inline from the PG log
- a clone that is missing only on a peer (due to being deleted) would
not stop rollback from promoting the clone, resulting in hitting an
assert on the replica when the promotion tried to write to the missing
object on the replica.

This only affects cache tiering due to the dependence on the
MAP_SNAP_CLONE flag in find_object_context() - missing_oid was not being checked for being
recovered, unlike the target oid for the op (in do_op()).

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd/PGLog: client deletes are now part of the missing set
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd/PGLog.h: update missing set verification for deletes
Deleted objects may still be on-disk after merging a log that includes
deletes, so adjust the asserts accordingly.

A case like:

980'1192 (972'1186) modify foo
--- osd restart ---
999'1196 (980'1192) delete foo
1003'1199 (0'0) modify foo
1015'1208 (1003'1199) delete foo

Would trigger the assert(miter->second.have == oi.info) since the
'have' version would would be reset to 0'0.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd/PGLog.h: handle lost+delete entries the same as client deletes
Signed-off-by: Josh Durgin <jdurgin@redhat.com>

@jdurgin jdurgin changed the title from DNM: osd: process deletes during recovery instead of peering to osd: process deletes during recovery instead of peering Jul 17, 2017

@jdurgin

This comment has been minimized.

Member

jdurgin commented Jul 17, 2017

Rebased and fixed up the remaining lost+unfound failures. I think this is ready to go now.

jdurgin added some commits Jun 27, 2017

osd/PGLog: reset complete_to when appending lost_delete entries
Since lost_deletes queue recovery directly, and don't go through
activate_not_complete(), our complete_to iterator may still point at
log.end() (a list iterator pointing to .end() will still point to
.end() after a push_back().). Reset it to point before these new
lost_delete entries. This is needed now that lost_deletes are
performed during recovery, instead of inline when merging logs.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd/PGBackend: include min_epoch in RecoveryDelete messages
This matches ordering with other recovery messages, and may speed up
processing a bit.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
include/ceph_features.h: add feature bit for handling deletes during …
…recovery

The BLKIN feature bit was actually unused - it was a remnant from
earlier versions of the blkin work, but all the encoding is handled by
struct-level versioning in the version that merged.

Use bit 60 (unused in any prior version) so that recovery deletes
could potentially be backported.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd: add incompat superblock feature for deletes during recovery
On-disk missing sets would need to be regenerated if downgraded from
luminous to kraken.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>

jdurgin added some commits Jul 4, 2017

osd_types, PGLog: encode missing based on features
Store whether the missing set should contain deletes, so that
persisted versions can be rebuilt if needed. Make missing_item
versioned, since it's persisted by the pg_log as an individual omap
value.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd_types, Objecter: make recovery_deletes feature create a new interval
This is needed to create a single place to regenerate the missing set
- at the start of a new interval where support for recovery deletes
changed.

The missing set is otherwise not cleared, so it would need to be
rebuilt in arbitrary places if e.g. an osd not supporting it went down
and restarted with support, or if we used a feature flag command to
trigger rebuilding the missing set.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
OSDMap, OSDMonitor: add flag for all osds supporting recovery deletes
Just like sortbitwise, this can only be toggled on, and once on osds
that do not support it are not allowed to boot.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
OSDMap, OSDMonitor: automatically set recovery deletes for luminous
Once the required osd release is luminous, all osds must support
recovery deletes, so set the flag then. This avoids an extra manual
step in luminous upgrades.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
TestPGLog: add unit tests for rebuilding missing set
Signed-off-by: Josh Durgin <jdurgin@redhat.com>
PGLog, PrimaryLogPG: rebuild the missing set when the OSDMap flag is set
The recovery_deletes flag will only be set once, by the 'ceph osd
require-osd-release luminous' command.

This matches how we rebuild the missing set when reading it off disk in
read_log_and_missing().

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
osd/PrimaryLogPG: guard lost_delete missing_loc change by feature flag
With deletes during recovery instead of during log processing, we need
to keep the entry in missing_loc.

Signed-off-by: Josh Durgin <jdurgin@redhat.com>
test: add a couple lost+delete unit tests
Signed-off-by: Josh Durgin <jdurgin@redhat.com>

@yuriw yuriw merged commit c968a3d into ceph:master Jul 19, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@jdurgin jdurgin deleted the jdurgin:wip-peering-deletes branch Jul 20, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment