DNM: osd: recovery optimization for overwrite Ops #7325

mslovy · 2016-01-22T11:37:52Z

This PR is based on #3837
introduce a can_recover_partial bit to make sure that only overwrite Ops can be recover as partial content.
Otherwise, run recovery as the normal process

Signed-off-by: Ning Yao yaoning@unitedstack.com

liewegas · 2016-01-25T22:45:27Z

@athanatos do you have high-level concerns before we look at this too closely?

mslovy · 2016-01-29T08:34:06Z

@liewegas , updated, please review again, thx.

athanatos · 2016-02-01T22:41:51Z

src/osd/osd_types.h

+      need = e.version;
+      can_recover_partial = (can_recover_partial && e.can_recover_partial);
+      dirty_extents.union_of(e.dirty_extents);
+    }
    void encode(bufferlist& bl) const {
      ::encode(need, bl);


If we are changing the encoding, we need to use the ENCODE_START etc macros

@athanatos , encoding process of struct item is wrapped by the encoding process of struct pg_missing_t. And pg_missing_t uses ENCODE_START, ENCODE_FINISH macros to deal with the comparability.

athanatos · 2016-02-01T22:45:52Z

So, the first thing I notice is that this branch appears to be buggy if you have, for example, an omap operation followed by a write in the librados operation (I think?). I also notice that an operation which sets a single xattr creates a log entry which cannot be partially recovered.

Rather than tracking dirty regions, we probably want to be tracking regions unmodified by a particular log entry. This value would start at [0, MAX] and write operations would subtract from it. This should probably be wrapped in another structure on the OpContext and the log entry structure next to the ObjectModDesc.

The drawback is that such a change does not allow us to simply write over the existing object, we'll have to atomically rename it out of the way, create a new object, and clone_range regions of the old file into the new one before removing the old one. This isn't particularly efficient with filestore, but even then, it avoids the network transfer. With bluestore, we'll be able to do it efficiently. It also has the advantage of letting us recover objects modified only by the snaptrimmer or scrub by sending only the xattrs.

athanatos · 2016-02-02T15:35:48Z

We actually have objectstore operations for removing omap and xattr, so we don't need to do the rename->clone_range thing.

Sage points out that we can avoid recovering the omap the same way if we add an omap_unchanged bool to the same structure as the clean_regions.

I do also realize that tracking clean regions rather than dirty regions is not actually different. What is different is letting that structure determine the recovery operation without needing a separate can_recover_partial bool.

mslovy · 2016-02-03T03:21:23Z

This is reasonable as bluestore is the future. If the copy_range cost can be accepted for FileStore, this is the better choice. But I still think that clean regions is actually the same with dirty regions. We can also avoid can_recover_partial bool by using dirty regions since dirty regions exactly equal the complementary set of clean regions?

athanatos · 2016-02-03T21:16:54Z

Yeah, that's what I meant by the end of my post. Mostly, I don't want to have the can_recover_partial bool.

sysnote · 2016-02-26T06:34:46Z

src/osd/ReplicatedBackend.cc

+  assert(it != missing.missing.end());
+  can_recover_partial = it->second.can_recover_partial;
+  if (can_recover_partial)
+    data_subset.intersection_of(it->second.dirty_extents);


if can_recover_partial, use dirty_extents, then can't use "data_subset.subtract(cloning)" in the following, because dirty_extents is none of business with clone_overlap

yeah， so I may use subtract_of(cloning) instead of subtract, which just subtract (the intersection of cloning and data_subset)

when recover partial, no need to use clone_subsets to do clone_range, isn't it?
so in my opinion, if can_recover_partial is true, we can ignore the calculation of clone_subsets and cloning part.

yeah, you are right. thanks for reminding

sysnote · 2016-02-26T07:22:40Z

consider such a upgrade situation which we need to upgrade to this can_recover_partial version:
eg. a pg 3.67 [0, 1, 2]
1)firstly, we update osd.0(service ceph restart osd.0), and recover normally, everything goes on;
2)a write req(eg. req1, will write to obj1) is sent to primary(osd.0), and pglog record such a req;
3)then we update osd.1, req1 send to osd.1 fail, but will send to osd.2, when osd.2 is dealing with the req(just in function do_request), pg3.67 starts peering, then on osd.7, it call can_discard_request to check that req1 should be dropped;
4)so the req1 only write successfuly on osd.0, because min_size=2, osd.0 re-enqueue the req1;
5)when peering, primary find that req1's object obj1 is missing on osd.1 and osd.2, so recover the object;
6)because osd.0 and osd.1 is already updated, osd.0 will calculate partial data in prep_push_to_replica, and osd.1 can deal with the partial data very well,
7)but osd.2 has not been updated, on osd.2's code logic(submit_push_data), it will remove origin object first, then write the partial data from osd.0, so the origin data of the object is lost;

mslovy · 2016-04-19T03:26:22Z

@sysnote
@liewegas , improvement based on @athanatos suggestion and compability issues
such as omap_unchanged bit and rebuild dirty_extent when read_log()
still remaining issue should be resolved here:
Object A: two modification 239'241 and 239‘243
therefore, currently, just 239'243 will be kept in rmissing, so that the last_complete point will move to 239'241 for non-partial-recovery. This is right if we recovery the whole object every time. However, if recover partial extents, the last_complete cannot be move to 239'242 since the log 239'241 should be considered here when read_log.

So maybe two suggestion for this:

treat item.need as a map<eversion_t> and record all need versions as well as in missing
Or 2. use some bits to tell those kinds of recovery should recover the whole object

Or any other better suggestions?

liewegas · 2016-04-25T16:49:40Z

I think in order for this to work we need to make sure that recovery does a single ObjectStore::Transaction on the object that brings it fully up to date with the latest version object. If that's not possible, we can fall back to normal full-object recovery. That means taking all of the previous updates to the object, combining the things that have changed, and applying all the changes at once. To do that we need to look backwards in the log using prior_version multiple hops.

I'm just rereviewing this from scratch, but it looks to me like dirty_extents is still possibly not enough information. At least, truncate isn't handled properly--it needs to dirty everything from new_size to old_size. And then on recovery, we need to ignore everything dirty beyond the final object size.

liewegas · 2016-04-25T16:50:52Z

src/osd/PGLog.cc

+      if (miter != missing.missing.end()) {
+        assert(did.count(i->soid));
+        miter->second.omap_unchanged = miter->second.omap_unchanged && i->omap_unchanged;
+        miter->second.dirty_extents.intersection_of(i->dirty_extents);


currently dirty_extents indicate object interval which is not modified. Thus, we need intersection_of here. I think the name dirty_extents would be confused here. The name, like clean_extents, would be better?

Ah, yeah... maybe unmodified_extents and unmodified_omap for consistency?

liewegas · 2016-04-25T16:53:38Z

Ok, it looks like dirty_extents just isn't being set correctly for truncate operations. Or things that implicitly truncate, like write_full, or clone.

mslovy · 2016-04-26T01:52:44Z

Ok, good point for looking backward in the log using its prior_version.

mslovy · 2016-04-26T01:58:55Z

src/osd/ReplicatedBackend.cc

+    interval_set<uint64_t> unmodified;
+    unmodified.insert(0, size);
+    unmodified.intersection_of(it->second.dirty_extents);
+    data_subset.subtract(unmodified);


Based on current strategy, the interval (new_size ~ old_size) of truncate operations might not add to dirty_extents since we always consider the data (from new_size to old_size) in the old object is invalid based on its data_subset. Therefore, the invalid data will not clone to the recovery object as expected.

athanatos · 2016-05-04T21:53:25Z

This is still undergoing design changes, so I adjusted the title and added pending-discussion.

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

Signed-off-by: yaoning <yaoning@unitedstack.com>

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

mslovy · 2016-09-19T02:40:39Z

@athanatos , Updated, I will squash the commit latter, any other suggestions?

mslovy · 2016-09-19T02:45:50Z

src/osd/ReplicatedBackend.cc

+      int r = store->stat(coll, ghobject_t(recovery_info.soid), &st);
+      if (recovery_info.object_exist) {
+	assert(r == 0);
+        uint64_t local_size = MIN(recovery_info.size, (uint64_t)st.st_size);


hi, @athanatos , we need local object size to make sure the clone_range operation is always cloning content within the ranges. So I think stat is unavoidable here? Otherwise, we need to retrieve local object_info to verify the size, which is similar with this one, any other ideas? If so, I think it is much straight forward to verify -ENOENT to judge whether the object is existed? bool object_exist in recovery_info is no needed?

Didn't the data included in the push depend on whether the object exists?

It can, but in order to resolve the issue (if the remote indicates {0~~1024, 10240~~1024}, which need to recovery. The hole exists {8192~1024} while the local object size is 8192, so we need to skip the local clone for the holes). In this way, we can deal with it. Another option is that calculating clone_interval on the remote side and transmit it inside recovery_info. I think you prefer the later one, right? I will modify it

mslovy · 2016-09-19T02:57:27Z

doc/dev/osd_internals/partial_object_recovery.rst

+        3) set alloc_hint for the new temp object
+        4) truncate new temp object to recovery_info.size
+        5) recovery omap_header
+        6) clone object_map from original object if omap is clean


it is awkward to add a new interface clone_omap in ObjectStore. Furthermore, clone_omap will change the origin omap to parent and generate two children. Even if the first object is deleted, the parent and its second child is still existed unless parent ref dec to zero (i.e. remove the clone object).
So I would like to retransmit the omap content if recovery cannot be finished in one PushOps, is that reasonable?
@athanatos

Seems reasonable to me. In the common case, it won't be present if not modified anyway (either it's empty, or it's involved in pretty much every update).

mslovy · 2016-09-19T03:01:51Z

src/osd/ReplicatedBackend.cc

+  } else {
+    // If omap is not changed, we need recovery omap when recovery cannot be completed once
+    if (progress.first && progress.omap_complete)
+      new_progress.omap_complete = false;


the implementation is here. If omap is not change and recovery cannot finish in one PushOp, then recovery, then transmit omap again in next PushOp. I think this is the simplest figuring out for the problem, right? @athanatos

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

athanatos · 2016-09-20T13:08:50Z

doc/dev/osd_internals/partial_object_recovery.rst

+        3) set alloc_hint for the new temp object
+        4) truncate new temp object to recovery_info.size
+        5) recovery omap_header
+        6) clone object_map from original object if omap is clean


Seems reasonable to me. In the common case, it won't be present if not modified anyway (either it's empty, or it's involved in pretty much every update).

athanatos · 2016-09-20T13:11:21Z

src/osd/PGLog.h

 	  if (cmp(i->soid, info.last_backfill, info.last_backfill_bitwise) > 0)
 	    continue;
 	  if (i->is_error())
 	    continue;
+	  // merge clean_regions in pg_log_entry_t if needed


This logic only comes into play if we are upgrading from the old encoding, right? This doesn't seem worthwhile, just fill out the missing set entry with the worst case.

@athanatos not absolutely. It is mainly used for the case that the osd is down after new pg_log is transmitted. In this case, we should find out all missing_item based on the local pg_log because pg_log is transmitted but the object has not recovered yet.

I don't think that's true. The recent missing_set changes always write down the missing items right?

@athanatos , urh， yeah，you are right, current master modify read_log/write_log to read_log_and_missing/write_log_and_missing. So it is used to check whether our missing is correct as expected based on the flag debug_verify_stored_missing.

And also, in this way, we need to consider retrieve pg_missing_item from the disk as you mention before. I think we can add a tail on the key such that p->key().substr(0, 7) == string("missing") && p->key().substr(7, 9) == string("v1") then decode new version and else decode old versions? Do you think this is reasonable?

Also I agree that it is much easier to deal with cases here if encode/decode is wrapped by the MACRO. But as discussed before, is that worthwhile to change the interface of the decode so that we can also pass the feature bit into the decode function?

athanatos · 2016-09-20T13:13:05Z

src/osd/ReplicatedBackend.cc

+    data_subset.intersection_of(it->second.clean_regions.get_dirty_regions());
+    dout(10) << "calc_head_subsets " << head
+             << "  data_subset " << data_subset << dendl;
+    return;


Why is this skipping the snapshot logic below?

@athanatos If the head object is modified, it implicitly indicates that the head content is different from the snapshot content. So there is no chance to copy the unmodified content from the snapshot. Furthermore, I think we should replace the clone_overlap in snapset as bounded_lossy_interval_set as you mentioned before? Currently, we find that taking snapshot sometimes makes too many item in clone_overlap, which lead to a quite large snapset attr (this will be ruined the performance during rebuiding ObjectContext )

I agree to the second part. I don't understand what you mean by the first part.

@athanatos dirty_region indicates the head object where it is modified, right? Didn't you think the already modified content must be non-overlapping with any snap object? We cannot ignore anything calculated from clone_subsets. Therefore, why do we use it?

That's not really how this works. Between version v (on the recovery target) and version v' (the current version), we might have created a clone at version c. Some of the dirty_regions between v and v' might already exist on that clone on the recovery target (because we already recovered it). Thus, we'd use clone_subsets for those. It may not be necessary, but you're creating a special case here, and that seems worse to me. Does you code not work properly if clone_subsets is non-empty?

Oh, yeah, I got it. Thanks, will update latter

athanatos · 2016-09-20T13:15:28Z

src/osd/ReplicatedBackend.cc

+      int r = store->stat(coll, ghobject_t(recovery_info.soid), &st);
+      if (recovery_info.object_exist) {
+	assert(r == 0);
+        uint64_t local_size = MIN(recovery_info.size, (uint64_t)st.st_size);


Didn't the data included in the push depend on whether the object exists?

athanatos · 2016-09-20T13:18:28Z

src/osd/ReplicatedBackend.cc

+  uint64_t z_offset = pop.before_progress.data_recovered_to;
+  uint64_t z_length = pop.after_progress.data_recovered_to - pop.before_progress.data_recovered_to;
+  if(z_length)
+    data_zeros.insert(z_offset, z_length);


I don't understand what this is doing. Couldn't this region contain data which we want to preserve?

I recalculate the data_zeros in the submit_push_data. Here, just bounded the data_zeros interval and indicate its possible regions

I don't really understand what you mean by that. If this region doesn't necessarily include zeroes, then why is it called data_zeroes?

@athanatos oh, yeah, so we call it hole_intervals or something others?

athanatos · 2016-09-20T13:19:16Z

src/osd/ReplicatedPG.cc

@@ -5458,6 +5465,7 @@ int ReplicatedPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)

 	write_update_size_and_usage(ctx->delta_stats, oi, ctx->modified_ranges,
 				    op.clonerange.offset, op.clonerange.length, false);
+	ctx->clean_regions.mark_data_region_dirty(op.clonerange.offset, op.clonerange.length);


Why can't we infer this from ctx->modified_ranges?

@athanatos modified_ranges is not accurate under some cases such as truncate, copy_from and finish promote. So you prefer to correct ctx->modified_ranges under those cases and infer the clean_regions once from modified_ranges? If so, we also need to add an interface mark_data_region_dirty(interval_set<uint64_t> dirty_region)? And furthermore, we need to do an extra iteration on ctx->modified_ranges.

If ctx->modified_ranges isn't correct for those cases, we have a larger problem since we use it for recovery... Yeah, you should definitely fix those cases and use ctx->modified_ranges.

mslovy · 2016-09-21T02:46:11Z

src/osd/ReplicatedBackend.cc

  }
  uint64_t off = 0;
  uint32_t fadvise_flags = CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL;
  if (cache_dont_need)
    fadvise_flags |= CEPH_OSD_OP_FLAG_FADVISE_DONTNEED;
+
+  // Punch zeros for data, if fiemap indicates nothing but it is marked dirty


Here is the real logic to punch zeros. copy_subset - intervals_included = fiemap_holes. I think we can transmit it on wire so that it is much easier and straight forward to do it? Also, it can be used in local clones?

We probably don't need to transmit it if we can compute it from copy_subset and intervals_included (which are transmitted, right?)

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

ghost · 2016-11-23T15:54:43Z

feel free to re-open once you have time to address Sam's comments

mslovy · 2017-04-04T03:39:12Z

@jdurgin, do you think it is time for as to reopen it now.

tchaikov · 2017-04-04T05:14:31Z

@mslovy i guess it's depends on "if you have time to address Sam's comments".

mslovy changed the title ~~Wip new recovery~~ osd: recovery optimization for overwrite Ops Jan 22, 2016

liewegas added feature core labels Jan 25, 2016

liewegas assigned athanatos Jan 25, 2016

markhpc added the performance label Jan 27, 2016

mslovy force-pushed the wip-new-recovery branch from 8a5c25e to d4810e3 Compare January 29, 2016 08:18

athanatos reviewed Feb 1, 2016
View reviewed changes

sysnote reviewed Feb 26, 2016
View reviewed changes

This was referenced Mar 14, 2016

Wip v0.94.6 recovery optimize #8082

Closed

wip-v0.94.6-recovery-optimize #8083

Closed

recovery optimize for log-based recovery #8229

Closed

mslovy force-pushed the wip-new-recovery branch from d4810e3 to b63468b Compare April 19, 2016 03:13

liewegas reviewed Apr 25, 2016
View reviewed changes

mslovy reviewed Apr 26, 2016
View reviewed changes

athanatos changed the title ~~osd: recovery optimization for overwrite Ops~~ DNM: osd: recovery optimization for overwrite Ops May 4, 2016

athanatos added the pending-discussion label May 4, 2016

mslovy force-pushed the wip-new-recovery branch from b63468b to 55ed607 Compare May 10, 2016 06:32

yaoning added 5 commits September 19, 2016 10:37

fix the type of _size in class interval_set, int64_t --> uint64_t

9775e39

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

osd: traverse pg_log_entry_t in read_log

6ed9180

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

doc: development document for partial object recovery

09d57f8

Signed-off-by: yaoning <yaoning@unitedstack.com>

auto in ReplicatedBackend.cc

e82cd5b

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

type modification

d82b5e1

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

mslovy force-pushed the wip-new-recovery branch from f8e3421 to 782250d Compare September 19, 2016 02:40

mslovy commented Sep 19, 2016

View reviewed changes

Ning Yao added 7 commits September 19, 2016 14:28

add test case for clone_range

e92d896

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

auto and format for ReplicatedBackend.cc

5a0b909

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

fix omap and params in submit_push_data()

9af18cf

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

object exist for osd_type

f841c2d

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

object exist for pglog

1364961

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

add param for check object_exist

4068110

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

verify whether object is existed during recovery

7dccede

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

mslovy force-pushed the wip-new-recovery branch from 782250d to 7dccede Compare September 19, 2016 06:30

athanatos reviewed Sep 20, 2016

View reviewed changes

mslovy commented Sep 21, 2016

View reviewed changes

fix unreasonable skipping of clone_subset

c9b1edf

Signed-off-by: Ning Yao <yaoning@unitedstack.com>

ghost closed this Nov 23, 2016

This was referenced Dec 14, 2017

osd: continue recovery optimization for overwrite Ops qiuming-best/ceph#3

Merged

osd: continue recovery optimization for overwrite Ops #19523

Closed

osd: continue recovery optimization for overwrite Ops #19569

Closed

qiuming-best mentioned this pull request Dec 22, 2017

osd: continue recovery optimization for overwrite Ops #19655

Closed

liuchang0812 unassigned athanatos Mar 16, 2018

This pull request was closed.

DNM: osd: recovery optimization for overwrite Ops #7325

DNM: osd: recovery optimization for overwrite Ops #7325

Conversation

mslovy commented Jan 22, 2016 • edited Loading

liewegas commented Jan 25, 2016

mslovy commented Jan 29, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

athanatos commented Feb 1, 2016

athanatos commented Feb 2, 2016

mslovy commented Feb 3, 2016

athanatos commented Feb 3, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sysnote commented Feb 26, 2016

mslovy commented Apr 19, 2016 • edited Loading

liewegas commented Apr 25, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Apr 25, 2016

mslovy commented Apr 26, 2016

mslovy Apr 26, 2016 • edited Loading

Choose a reason for hiding this comment

athanatos commented May 4, 2016

mslovy commented Sep 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mslovy Sep 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mslovy Sep 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Nov 23, 2016

mslovy commented Apr 4, 2017

tchaikov commented Apr 4, 2017

mslovy commented Jan 22, 2016 •

edited

Loading

mslovy commented Apr 19, 2016 •

edited

Loading

mslovy Apr 26, 2016 •

edited

Loading

mslovy commented Sep 19, 2016 •

edited

Loading

mslovy Sep 19, 2016 •

edited

Loading

mslovy Sep 19, 2016 •

edited

Loading