-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNM: osd: recovery optimization for overwrite Ops #7325
Conversation
@athanatos do you have high-level concerns before we look at this too closely? |
8a5c25e
to
d4810e3
Compare
@liewegas , updated, please review again, thx. |
need = e.version; | ||
can_recover_partial = (can_recover_partial && e.can_recover_partial); | ||
dirty_extents.union_of(e.dirty_extents); | ||
} | ||
void encode(bufferlist& bl) const { | ||
::encode(need, bl); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are changing the encoding, we need to use the ENCODE_START etc macros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athanatos , encoding process of struct item is wrapped by the encoding process of struct pg_missing_t. And pg_missing_t uses ENCODE_START, ENCODE_FINISH macros to deal with the comparability.
So, the first thing I notice is that this branch appears to be buggy if you have, for example, an omap operation followed by a write in the librados operation (I think?). I also notice that an operation which sets a single xattr creates a log entry which cannot be partially recovered. Rather than tracking dirty regions, we probably want to be tracking regions unmodified by a particular log entry. This value would start at [0, MAX] and write operations would subtract from it. This should probably be wrapped in another structure on the OpContext and the log entry structure next to the ObjectModDesc. The drawback is that such a change does not allow us to simply write over the existing object, we'll have to atomically rename it out of the way, create a new object, and clone_range regions of the old file into the new one before removing the old one. This isn't particularly efficient with filestore, but even then, it avoids the network transfer. With bluestore, we'll be able to do it efficiently. It also has the advantage of letting us recover objects modified only by the snaptrimmer or scrub by sending only the xattrs. |
We actually have objectstore operations for removing omap and xattr, so we don't need to do the rename->clone_range thing. Sage points out that we can avoid recovering the omap the same way if we add an omap_unchanged bool to the same structure as the clean_regions. I do also realize that tracking clean regions rather than dirty regions is not actually different. What is different is letting that structure determine the recovery operation without needing a separate can_recover_partial bool. |
This is reasonable as bluestore is the future. If the copy_range cost can be accepted for FileStore, this is the better choice. But I still think that clean regions is actually the same with dirty regions. We can also avoid can_recover_partial bool by using dirty regions since dirty regions exactly equal the complementary set of clean regions? |
Yeah, that's what I meant by the end of my post. Mostly, I don't want to have the can_recover_partial bool. |
assert(it != missing.missing.end()); | ||
can_recover_partial = it->second.can_recover_partial; | ||
if (can_recover_partial) | ||
data_subset.intersection_of(it->second.dirty_extents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if can_recover_partial, use dirty_extents, then can't use "data_subset.subtract(cloning)" in the following, because dirty_extents is none of business with clone_overlap
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, so I may use subtract_of(cloning) instead of subtract, which just subtract (the intersection of cloning and data_subset)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when recover partial, no need to use clone_subsets to do clone_range, isn't it?
so in my opinion, if can_recover_partial is true, we can ignore the calculation of clone_subsets and cloning part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, you are right. thanks for reminding
consider such a upgrade situation which we need to upgrade to this can_recover_partial version: |
d4810e3
to
b63468b
Compare
@sysnote So maybe two suggestion for this:
Or any other better suggestions? |
I think in order for this to work we need to make sure that recovery does a single ObjectStore::Transaction on the object that brings it fully up to date with the latest version object. If that's not possible, we can fall back to normal full-object recovery. That means taking all of the previous updates to the object, combining the things that have changed, and applying all the changes at once. To do that we need to look backwards in the log using prior_version multiple hops. I'm just rereviewing this from scratch, but it looks to me like dirty_extents is still possibly not enough information. At least, truncate isn't handled properly--it needs to dirty everything from new_size to old_size. And then on recovery, we need to ignore everything dirty beyond the final object size. |
if (miter != missing.missing.end()) { | ||
assert(did.count(i->soid)); | ||
miter->second.omap_unchanged = miter->second.omap_unchanged && i->omap_unchanged; | ||
miter->second.dirty_extents.intersection_of(i->dirty_extents); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
union_of
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
currently dirty_extents indicate object interval which is not modified. Thus, we need intersection_of here. I think the name dirty_extents would be confused here. The name, like clean_extents, would be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, yeah... maybe unmodified_extents and unmodified_omap for consistency?
Ok, it looks like dirty_extents just isn't being set correctly for truncate operations. Or things that implicitly truncate, like write_full, or clone. |
Ok, good point for looking backward in the log using its prior_version. |
interval_set<uint64_t> unmodified; | ||
unmodified.insert(0, size); | ||
unmodified.intersection_of(it->second.dirty_extents); | ||
data_subset.subtract(unmodified); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on current strategy, the interval (new_size ~ old_size) of truncate operations might not add to dirty_extents since we always consider the data (from new_size to old_size) in the old object is invalid based on its data_subset. Therefore, the invalid data will not clone to the recovery object as expected.
This is still undergoing design changes, so I adjusted the title and added pending-discussion. |
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: yaoning <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
f8e3421
to
782250d
Compare
@athanatos , Updated, I will squash the commit latter, any other suggestions? |
int r = store->stat(coll, ghobject_t(recovery_info.soid), &st); | ||
if (recovery_info.object_exist) { | ||
assert(r == 0); | ||
uint64_t local_size = MIN(recovery_info.size, (uint64_t)st.st_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, @athanatos , we need local object size to make sure the clone_range operation is always cloning content within the ranges. So I think stat is unavoidable here? Otherwise, we need to retrieve local object_info to verify the size, which is similar with this one, any other ideas? If so, I think it is much straight forward to verify -ENOENT to judge whether the object is existed? bool object_exist in recovery_info is no needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't the data included in the push depend on whether the object exists?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can, but in order to resolve the issue (if the remote indicates {01024, 102401024}, which need to recovery. The hole exists {8192~1024} while the local object size is 8192, so we need to skip the local clone for the holes). In this way, we can deal with it. Another option is that calculating clone_interval on the remote side and transmit it inside recovery_info. I think you prefer the later one, right? I will modify it
3) set alloc_hint for the new temp object | ||
4) truncate new temp object to recovery_info.size | ||
5) recovery omap_header | ||
6) clone object_map from original object if omap is clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is awkward to add a new interface clone_omap in ObjectStore. Furthermore, clone_omap will change the origin omap to parent and generate two children. Even if the first object is deleted, the parent and its second child is still existed unless parent ref dec to zero (i.e. remove the clone object).
So I would like to retransmit the omap content if recovery cannot be finished in one PushOps, is that reasonable?
@athanatos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me. In the common case, it won't be present if not modified anyway (either it's empty, or it's involved in pretty much every update).
} else { | ||
// If omap is not changed, we need recovery omap when recovery cannot be completed once | ||
if (progress.first && progress.omap_complete) | ||
new_progress.omap_complete = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the implementation is here. If omap is not change and recovery cannot finish in one PushOp, then recovery, then transmit omap again in next PushOp. I think this is the simplest figuring out for the problem, right? @athanatos
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
782250d
to
7dccede
Compare
3) set alloc_hint for the new temp object | ||
4) truncate new temp object to recovery_info.size | ||
5) recovery omap_header | ||
6) clone object_map from original object if omap is clean |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me. In the common case, it won't be present if not modified anyway (either it's empty, or it's involved in pretty much every update).
if (cmp(i->soid, info.last_backfill, info.last_backfill_bitwise) > 0) | ||
continue; | ||
if (i->is_error()) | ||
continue; | ||
// merge clean_regions in pg_log_entry_t if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic only comes into play if we are upgrading from the old encoding, right? This doesn't seem worthwhile, just fill out the missing set entry with the worst case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athanatos not absolutely. It is mainly used for the case that the osd is down after new pg_log is transmitted. In this case, we should find out all missing_item based on the local pg_log because pg_log is transmitted but the object has not recovered yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's true. The recent missing_set changes always write down the missing items right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athanatos , urh, yeah,you are right, current master modify read_log/write_log to read_log_and_missing/write_log_and_missing. So it is used to check whether our missing is correct as expected based on the flag debug_verify_stored_missing.
And also, in this way, we need to consider retrieve pg_missing_item from the disk as you mention before. I think we can add a tail on the key such that p->key().substr(0, 7) == string("missing") && p->key().substr(7, 9) == string("v1") then decode new version and else decode old versions? Do you think this is reasonable?
Also I agree that it is much easier to deal with cases here if encode/decode is wrapped by the MACRO. But as discussed before, is that worthwhile to change the interface of the decode so that we can also pass the feature bit into the decode function?
data_subset.intersection_of(it->second.clean_regions.get_dirty_regions()); | ||
dout(10) << "calc_head_subsets " << head | ||
<< " data_subset " << data_subset << dendl; | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this skipping the snapshot logic below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athanatos If the head object is modified, it implicitly indicates that the head content is different from the snapshot content. So there is no chance to copy the unmodified content from the snapshot. Furthermore, I think we should replace the clone_overlap in snapset as bounded_lossy_interval_set as you mentioned before? Currently, we find that taking snapshot sometimes makes too many item in clone_overlap, which lead to a quite large snapset attr (this will be ruined the performance during rebuiding ObjectContext )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree to the second part. I don't understand what you mean by the first part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athanatos dirty_region indicates the head object where it is modified, right? Didn't you think the already modified content must be non-overlapping with any snap object? We cannot ignore anything calculated from clone_subsets. Therefore, why do we use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not really how this works. Between version v (on the recovery target) and version v' (the current version), we might have created a clone at version c. Some of the dirty_regions between v and v' might already exist on that clone on the recovery target (because we already recovered it). Thus, we'd use clone_subsets for those. It may not be necessary, but you're creating a special case here, and that seems worse to me. Does you code not work properly if clone_subsets is non-empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, yeah, I got it. Thanks, will update latter
int r = store->stat(coll, ghobject_t(recovery_info.soid), &st); | ||
if (recovery_info.object_exist) { | ||
assert(r == 0); | ||
uint64_t local_size = MIN(recovery_info.size, (uint64_t)st.st_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't the data included in the push depend on whether the object exists?
uint64_t z_offset = pop.before_progress.data_recovered_to; | ||
uint64_t z_length = pop.after_progress.data_recovered_to - pop.before_progress.data_recovered_to; | ||
if(z_length) | ||
data_zeros.insert(z_offset, z_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is doing. Couldn't this region contain data which we want to preserve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recalculate the data_zeros in the submit_push_data. Here, just bounded the data_zeros interval and indicate its possible regions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand what you mean by that. If this region doesn't necessarily include zeroes, then why is it called data_zeroes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athanatos oh, yeah, so we call it hole_intervals or something others?
@@ -5458,6 +5465,7 @@ int ReplicatedPG::do_osd_ops(OpContext *ctx, vector<OSDOp>& ops) | |||
|
|||
write_update_size_and_usage(ctx->delta_stats, oi, ctx->modified_ranges, | |||
op.clonerange.offset, op.clonerange.length, false); | |||
ctx->clean_regions.mark_data_region_dirty(op.clonerange.offset, op.clonerange.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we infer this from ctx->modified_ranges?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@athanatos modified_ranges is not accurate under some cases such as truncate, copy_from and finish promote. So you prefer to correct ctx->modified_ranges under those cases and infer the clean_regions once from modified_ranges? If so, we also need to add an interface mark_data_region_dirty(interval_set<uint64_t> dirty_region)? And furthermore, we need to do an extra iteration on ctx->modified_ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If ctx->modified_ranges isn't correct for those cases, we have a larger problem since we use it for recovery... Yeah, you should definitely fix those cases and use ctx->modified_ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
} | ||
uint64_t off = 0; | ||
uint32_t fadvise_flags = CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL; | ||
if (cache_dont_need) | ||
fadvise_flags |= CEPH_OSD_OP_FLAG_FADVISE_DONTNEED; | ||
|
||
// Punch zeros for data, if fiemap indicates nothing but it is marked dirty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the real logic to punch zeros. copy_subset - intervals_included = fiemap_holes. I think we can transmit it on wire so that it is much easier and straight forward to do it? Also, it can be used in local clones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably don't need to transmit it if we can compute it from copy_subset and intervals_included (which are transmitted, right?)
Signed-off-by: Ning Yao <yaoning@unitedstack.com>
feel free to re-open once you have time to address Sam's comments |
@jdurgin, do you think it is time for as to reopen it now. |
@mslovy i guess it's depends on "if you have time to address Sam's comments". |
This PR is based on #3837
introduce a can_recover_partial bit to make sure that only overwrite Ops can be recover as partial content.
Otherwise, run recovery as the normal process
Signed-off-by: Ning Yao yaoning@unitedstack.com