Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNM: osd: recovery optimization for overwrite Ops #7325

Closed
wants to merge 18 commits into from

Conversation

mslovy
Copy link
Contributor

@mslovy mslovy commented Jan 22, 2016

This PR is based on #3837
introduce a can_recover_partial bit to make sure that only overwrite Ops can be recover as partial content.
Otherwise, run recovery as the normal process

Signed-off-by: Ning Yao yaoning@unitedstack.com

@mslovy mslovy changed the title Wip new recovery osd: recovery optimization for overwrite Ops Jan 22, 2016
@liewegas
Copy link
Member

@athanatos do you have high-level concerns before we look at this too closely?

@mslovy
Copy link
Contributor Author

mslovy commented Jan 29, 2016

@liewegas , updated, please review again, thx.

need = e.version;
can_recover_partial = (can_recover_partial && e.can_recover_partial);
dirty_extents.union_of(e.dirty_extents);
}
void encode(bufferlist& bl) const {
::encode(need, bl);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are changing the encoding, we need to use the ENCODE_START etc macros

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos , encoding process of struct item is wrapped by the encoding process of struct pg_missing_t. And pg_missing_t uses ENCODE_START, ENCODE_FINISH macros to deal with the comparability.

@athanatos
Copy link
Contributor

So, the first thing I notice is that this branch appears to be buggy if you have, for example, an omap operation followed by a write in the librados operation (I think?). I also notice that an operation which sets a single xattr creates a log entry which cannot be partially recovered.

Rather than tracking dirty regions, we probably want to be tracking regions unmodified by a particular log entry. This value would start at [0, MAX] and write operations would subtract from it. This should probably be wrapped in another structure on the OpContext and the log entry structure next to the ObjectModDesc.

The drawback is that such a change does not allow us to simply write over the existing object, we'll have to atomically rename it out of the way, create a new object, and clone_range regions of the old file into the new one before removing the old one. This isn't particularly efficient with filestore, but even then, it avoids the network transfer. With bluestore, we'll be able to do it efficiently. It also has the advantage of letting us recover objects modified only by the snaptrimmer or scrub by sending only the xattrs.

@athanatos
Copy link
Contributor

We actually have objectstore operations for removing omap and xattr, so we don't need to do the rename->clone_range thing.

Sage points out that we can avoid recovering the omap the same way if we add an omap_unchanged bool to the same structure as the clean_regions.

I do also realize that tracking clean regions rather than dirty regions is not actually different. What is different is letting that structure determine the recovery operation without needing a separate can_recover_partial bool.

@mslovy
Copy link
Contributor Author

mslovy commented Feb 3, 2016

This is reasonable as bluestore is the future. If the copy_range cost can be accepted for FileStore, this is the better choice. But I still think that clean regions is actually the same with dirty regions. We can also avoid can_recover_partial bool by using dirty regions since dirty regions exactly equal the complementary set of clean regions?

@athanatos
Copy link
Contributor

Yeah, that's what I meant by the end of my post. Mostly, I don't want to have the can_recover_partial bool.

assert(it != missing.missing.end());
can_recover_partial = it->second.can_recover_partial;
if (can_recover_partial)
data_subset.intersection_of(it->second.dirty_extents);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if can_recover_partial, use dirty_extents, then can't use "data_subset.subtract(cloning)" in the following, because dirty_extents is none of business with clone_overlap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, so I may use subtract_of(cloning) instead of subtract, which just subtract (the intersection of cloning and data_subset)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when recover partial, no need to use clone_subsets to do clone_range, isn't it?
so in my opinion, if can_recover_partial is true, we can ignore the calculation of clone_subsets and cloning part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, you are right. thanks for reminding

@sysnote
Copy link
Contributor

sysnote commented Feb 26, 2016

consider such a upgrade situation which we need to upgrade to this can_recover_partial version:
eg. a pg 3.67 [0, 1, 2]
1)firstly, we update osd.0(service ceph restart osd.0), and recover normally, everything goes on;
2)a write req(eg. req1, will write to obj1) is sent to primary(osd.0), and pglog record such a req;
3)then we update osd.1, req1 send to osd.1 fail, but will send to osd.2, when osd.2 is dealing with the req(just in function do_request), pg3.67 starts peering, then on osd.7, it call can_discard_request to check that req1 should be dropped;
4)so the req1 only write successfuly on osd.0, because min_size=2, osd.0 re-enqueue the req1;
5)when peering, primary find that req1's object obj1 is missing on osd.1 and osd.2, so recover the object;
6)because osd.0 and osd.1 is already updated, osd.0 will calculate partial data in prep_push_to_replica, and osd.1 can deal with the partial data very well,
7)but osd.2 has not been updated, on osd.2's code logic(submit_push_data), it will remove origin object first, then write the partial data from osd.0, so the origin data of the object is lost;

@mslovy
Copy link
Contributor Author

mslovy commented Apr 19, 2016

@sysnote
@liewegas , improvement based on @athanatos suggestion and compability issues
such as omap_unchanged bit and rebuild dirty_extent when read_log()
still remaining issue should be resolved here:
Object A: two modification 239'241 and 239‘243
therefore, currently, just 239'243 will be kept in rmissing, so that the last_complete point will move to 239'241 for non-partial-recovery. This is right if we recovery the whole object every time. However, if recover partial extents, the last_complete cannot be move to 239'242 since the log 239'241 should be considered here when read_log.

So maybe two suggestion for this:

  1. treat item.need as a map<eversion_t> and record all need versions as well as in missing
    Or 2. use some bits to tell those kinds of recovery should recover the whole object

Or any other better suggestions?

@liewegas
Copy link
Member

I think in order for this to work we need to make sure that recovery does a single ObjectStore::Transaction on the object that brings it fully up to date with the latest version object. If that's not possible, we can fall back to normal full-object recovery. That means taking all of the previous updates to the object, combining the things that have changed, and applying all the changes at once. To do that we need to look backwards in the log using prior_version multiple hops.

I'm just rereviewing this from scratch, but it looks to me like dirty_extents is still possibly not enough information. At least, truncate isn't handled properly--it needs to dirty everything from new_size to old_size. And then on recovery, we need to ignore everything dirty beyond the final object size.

if (miter != missing.missing.end()) {
assert(did.count(i->soid));
miter->second.omap_unchanged = miter->second.omap_unchanged && i->omap_unchanged;
miter->second.dirty_extents.intersection_of(i->dirty_extents);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

union_of

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently dirty_extents indicate object interval which is not modified. Thus, we need intersection_of here. I think the name dirty_extents would be confused here. The name, like clean_extents, would be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah... maybe unmodified_extents and unmodified_omap for consistency?

@liewegas
Copy link
Member

Ok, it looks like dirty_extents just isn't being set correctly for truncate operations. Or things that implicitly truncate, like write_full, or clone.

@mslovy
Copy link
Contributor Author

mslovy commented Apr 26, 2016

Ok, good point for looking backward in the log using its prior_version.

interval_set<uint64_t> unmodified;
unmodified.insert(0, size);
unmodified.intersection_of(it->second.dirty_extents);
data_subset.subtract(unmodified);
Copy link
Contributor Author

@mslovy mslovy Apr 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on current strategy, the interval (new_size ~ old_size) of truncate operations might not add to dirty_extents since we always consider the data (from new_size to old_size) in the old object is invalid based on its data_subset. Therefore, the invalid data will not clone to the recovery object as expected.

@athanatos athanatos changed the title osd: recovery optimization for overwrite Ops DNM: osd: recovery optimization for overwrite Ops May 4, 2016
@athanatos
Copy link
Contributor

This is still undergoing design changes, so I adjusted the title and added pending-discussion.

yaoning added 5 commits September 19, 2016 10:37
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: yaoning <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
@mslovy
Copy link
Contributor Author

mslovy commented Sep 19, 2016

@athanatos , Updated, I will squash the commit latter, any other suggestions?

int r = store->stat(coll, ghobject_t(recovery_info.soid), &st);
if (recovery_info.object_exist) {
assert(r == 0);
uint64_t local_size = MIN(recovery_info.size, (uint64_t)st.st_size);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi, @athanatos , we need local object size to make sure the clone_range operation is always cloning content within the ranges. So I think stat is unavoidable here? Otherwise, we need to retrieve local object_info to verify the size, which is similar with this one, any other ideas? If so, I think it is much straight forward to verify -ENOENT to judge whether the object is existed? bool object_exist in recovery_info is no needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't the data included in the push depend on whether the object exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can, but in order to resolve the issue (if the remote indicates {01024, 102401024}, which need to recovery. The hole exists {8192~1024} while the local object size is 8192, so we need to skip the local clone for the holes). In this way, we can deal with it. Another option is that calculating clone_interval on the remote side and transmit it inside recovery_info. I think you prefer the later one, right? I will modify it

3) set alloc_hint for the new temp object
4) truncate new temp object to recovery_info.size
5) recovery omap_header
6) clone object_map from original object if omap is clean
Copy link
Contributor Author

@mslovy mslovy Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is awkward to add a new interface clone_omap in ObjectStore. Furthermore, clone_omap will change the origin omap to parent and generate two children. Even if the first object is deleted, the parent and its second child is still existed unless parent ref dec to zero (i.e. remove the clone object).
So I would like to retransmit the omap content if recovery cannot be finished in one PushOps, is that reasonable?
@athanatos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. In the common case, it won't be present if not modified anyway (either it's empty, or it's involved in pretty much every update).

} else {
// If omap is not changed, we need recovery omap when recovery cannot be completed once
if (progress.first && progress.omap_complete)
new_progress.omap_complete = false;
Copy link
Contributor Author

@mslovy mslovy Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the implementation is here. If omap is not change and recovery cannot finish in one PushOp, then recovery, then transmit omap again in next PushOp. I think this is the simplest figuring out for the problem, right? @athanatos

Ning Yao added 7 commits September 19, 2016 14:28
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
3) set alloc_hint for the new temp object
4) truncate new temp object to recovery_info.size
5) recovery omap_header
6) clone object_map from original object if omap is clean
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. In the common case, it won't be present if not modified anyway (either it's empty, or it's involved in pretty much every update).

if (cmp(i->soid, info.last_backfill, info.last_backfill_bitwise) > 0)
continue;
if (i->is_error())
continue;
// merge clean_regions in pg_log_entry_t if needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic only comes into play if we are upgrading from the old encoding, right? This doesn't seem worthwhile, just fill out the missing set entry with the worst case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos not absolutely. It is mainly used for the case that the osd is down after new pg_log is transmitted. In this case, we should find out all missing_item based on the local pg_log because pg_log is transmitted but the object has not recovered yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true. The recent missing_set changes always write down the missing items right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos , urh, yeah,you are right, current master modify read_log/write_log to read_log_and_missing/write_log_and_missing. So it is used to check whether our missing is correct as expected based on the flag debug_verify_stored_missing.

And also, in this way, we need to consider retrieve pg_missing_item from the disk as you mention before. I think we can add a tail on the key such that p->key().substr(0, 7) == string("missing") && p->key().substr(7, 9) == string("v1") then decode new version and else decode old versions? Do you think this is reasonable?

Also I agree that it is much easier to deal with cases here if encode/decode is wrapped by the MACRO. But as discussed before, is that worthwhile to change the interface of the decode so that we can also pass the feature bit into the decode function?

data_subset.intersection_of(it->second.clean_regions.get_dirty_regions());
dout(10) << "calc_head_subsets " << head
<< " data_subset " << data_subset << dendl;
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this skipping the snapshot logic below?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos If the head object is modified, it implicitly indicates that the head content is different from the snapshot content. So there is no chance to copy the unmodified content from the snapshot. Furthermore, I think we should replace the clone_overlap in snapset as bounded_lossy_interval_set as you mentioned before? Currently, we find that taking snapshot sometimes makes too many item in clone_overlap, which lead to a quite large snapset attr (this will be ruined the performance during rebuiding ObjectContext )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree to the second part. I don't understand what you mean by the first part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos dirty_region indicates the head object where it is modified, right? Didn't you think the already modified content must be non-overlapping with any snap object? We cannot ignore anything calculated from clone_subsets. Therefore, why do we use it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not really how this works. Between version v (on the recovery target) and version v' (the current version), we might have created a clone at version c. Some of the dirty_regions between v and v' might already exist on that clone on the recovery target (because we already recovered it). Thus, we'd use clone_subsets for those. It may not be necessary, but you're creating a special case here, and that seems worse to me. Does you code not work properly if clone_subsets is non-empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yeah, I got it. Thanks, will update latter

int r = store->stat(coll, ghobject_t(recovery_info.soid), &st);
if (recovery_info.object_exist) {
assert(r == 0);
uint64_t local_size = MIN(recovery_info.size, (uint64_t)st.st_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't the data included in the push depend on whether the object exists?

uint64_t z_offset = pop.before_progress.data_recovered_to;
uint64_t z_length = pop.after_progress.data_recovered_to - pop.before_progress.data_recovered_to;
if(z_length)
data_zeros.insert(z_offset, z_length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is doing. Couldn't this region contain data which we want to preserve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recalculate the data_zeros in the submit_push_data. Here, just bounded the data_zeros interval and indicate its possible regions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand what you mean by that. If this region doesn't necessarily include zeroes, then why is it called data_zeroes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos oh, yeah, so we call it hole_intervals or something others?

@@ -5458,6 +5465,7 @@ int ReplicatedPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops)

write_update_size_and_usage(ctx->delta_stats, oi, ctx->modified_ranges,
op.clonerange.offset, op.clonerange.length, false);
ctx->clean_regions.mark_data_region_dirty(op.clonerange.offset, op.clonerange.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we infer this from ctx->modified_ranges?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos modified_ranges is not accurate under some cases such as truncate, copy_from and finish promote. So you prefer to correct ctx->modified_ranges under those cases and infer the clean_regions once from modified_ranges? If so, we also need to add an interface mark_data_region_dirty(interval_set<uint64_t> dirty_region)? And furthermore, we need to do an extra iteration on ctx->modified_ranges.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ctx->modified_ranges isn't correct for those cases, we have a larger problem since we use it for recovery... Yeah, you should definitely fix those cases and use ctx->modified_ranges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

}
uint64_t off = 0;
uint32_t fadvise_flags = CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL;
if (cache_dont_need)
fadvise_flags |= CEPH_OSD_OP_FLAG_FADVISE_DONTNEED;

// Punch zeros for data, if fiemap indicates nothing but it is marked dirty
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the real logic to punch zeros. copy_subset - intervals_included = fiemap_holes. I think we can transmit it on wire so that it is much easier and straight forward to do it? Also, it can be used in local clones?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to transmit it if we can compute it from copy_subset and intervals_included (which are transmitted, right?)

Signed-off-by: Ning Yao <yaoning@unitedstack.com>
@ghost
Copy link

ghost commented Nov 23, 2016

feel free to re-open once you have time to address Sam's comments

@ghost ghost closed this Nov 23, 2016
@mslovy
Copy link
Contributor Author

mslovy commented Apr 4, 2017

@jdurgin, do you think it is time for as to reopen it now.

@tchaikov
Copy link
Contributor

tchaikov commented Apr 4, 2017

@mslovy i guess it's depends on "if you have time to address Sam's comments".

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants