-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNM: osd: recovery optimization for overwrite Ops #7325
Changes from 17 commits
85fec3b
f776ffc
4e2ab34
bf3fdef
4b027a2
9775e39
6ed9180
09d57f8
e82cd5b
d82b5e1
e92d896
5a0b909
9af18cf
f841c2d
1364961
4068110
7dccede
c9b1edf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
==================== | ||
Partial Object Recovery | ||
==================== | ||
|
||
Partial Object Recovery devotes to improving the efficiency of | ||
log-based recovery rather than backfill. Original log-based recovery | ||
calculates missing_set based on the difference between pg_log. | ||
The whole object should be recovery from one OSD to another | ||
if the object is indicated modified by pg_log regardless of how much | ||
content in the object is really modified. That means a 4M object, | ||
which is just modified 4k inside, should be recovery the whole 4M object | ||
rather than the modified 4k content. In addition, object map should be | ||
also recovered even if it is not modified at all. | ||
|
||
Partial Object Recovery is designed to solve the problem mentioned above. | ||
In order to achieve the goals, two things should be done: | ||
1. logging where the object is modified is neccessary | ||
2. logging whether the object_map of an object is modified is also neccessary | ||
|
||
class ObjectCleanRegion is introduced to do what we want. | ||
clean_offsets is a variable of interval_set<uint64_t> | ||
and is used to indicate the unmodified content in an object. | ||
clean_omap is a varible of bool indicating whether object_map is modified. | ||
max_num_intervals is an upbound of the number of intervals in clean_offsets | ||
so that the memory cost of clean_offsets is always bounded. | ||
The shortest clean interval will be trimmed if the number of intervals | ||
in clean_offsets exceeds the boundary. | ||
etc. max_num_intervals=2, clean_offsets:{[5~10], [20~5]} | ||
then new interval [30~10] will evict out the shortest one [20~5] | ||
finally, clean_offsets becomes {[5~10], [30~10]} | ||
|
||
|
||
Procedures for Partial Object Recovery | ||
|
||
Firstly, OpContext and pg_log_entry_t should contain ObjectCleanRegion. | ||
In do_osd_ops(), finish_copyfrom(), finish_promote(), corresponding content | ||
in ObjectCleanRegion should mark dirty so that trace the modification of an object. | ||
Also update ObjectCleanRegion in OpContext to its pg_log_entry_t. | ||
|
||
Secondly, pg_missing_set can build and rebuild correctly. | ||
when calculating pg_missing_set during peering process, | ||
also merge ObjectCleanRegion in each pg_log_entry_t. | ||
etc. object aa has pg_log: | ||
26'101 {[0~4096, 8192~MAX], false} | ||
26'104 {0~8192, 12288~MAX, false} | ||
28'108 {[0~12288, 16384~MAX], true} | ||
missing_set for object aa: merge pg_log above --> {[0~4096, 16384~MAX], true} | ||
which means 4096~16384 is modified and object_map is also modified on version 28'108 | ||
Also, OSD may be crash after merge log. | ||
Therefore, we need to read_log and rebuild pg_missing_set. For example, pg_log is | ||
object aa: 26'101 {[0~4096, 8192~MAX], false} | ||
object bb: 26'102 {[0~4096, 8192~MAX], false} | ||
object cc: 26'103 {[0~4096, 8192~MAX], false} | ||
object aa: 26'104 {0~8192, 12288~MAX, false} | ||
object dd: 26'105 {[0~4096, 8192~MAX], false} | ||
object aa: 28'108 {[0~12288, 16384~MAX], true} | ||
Orignally, if bb,cc,dd is recovered, and aa is not. | ||
So we need to rebuild pg_missing_set for object aa, | ||
and find aa is modified on version 28'108. | ||
If version in object_info is 26'96 < 28'108, | ||
we don't need to consider 26'104 and 26'101 because the whole object will be recovered. | ||
However, Partial Object Recovery should also require us to rebuild ObjectCleanRegion. | ||
Knowing whether the object is modified is not enough. | ||
Therefore, we also need to tranverse the pg_log before, | ||
that says 26'104 and 26'101 also > object_info(26'96) | ||
and rebuild pg_missing_set for object aa based on those three logs: 28'108, 26'104, 26'101. | ||
The way how to merge logs is the same as mentioned above | ||
|
||
Finally, finish the push and pull process based on pg_missing_set. | ||
Updating copy_subset in recovery_info based on ObjectCleanRegion in pg_missing_set. | ||
copy_subset indicates the intervals of content need to pull and push. | ||
The complicated part here is submit_push_data | ||
and serval cases should be considered seperately. | ||
what we need to consider is how to deal with the object data, | ||
object data makes up of omap_header, xattrs, omap, data: | ||
|
||
case 1 -- first && complete: since object recovering is finished in a single PushOp, | ||
we would like to preserve the original object and overwrite on the object directly. | ||
Object will not be removed and touch a new one. | ||
issue 1: As object is not removed, old xattrs remain in the old object | ||
but maybe updated in new object. | ||
Overwriting for the same key is correct or adding new keys is correct, | ||
but removing keys will be wrong. In order to solve this issue, | ||
we need to remove the all original xattrs in the object, and then update new xattrs. | ||
issue 2: As object is not removed, | ||
object_map may be recovered depending on the clean_omap. | ||
Therefore, if recovering clean_omap, we need to remove old omap of the object for the same reason | ||
since omap updating may also be a deletion. | ||
Thus, in this case, we should do: | ||
1) clear xattrs of the object | ||
2) clear omap of the object if omap recovery is needed | ||
3) truncate the object into recovery_info.size | ||
4) recovery omap_header | ||
5) recovery xattrs, and recover omap if needed | ||
6) punch zeros for original object if fiemap tells nothing there | ||
7) overwrite object content which is modified | ||
8) finish recovery | ||
|
||
case 2 -- first && !complete: object recovering should be done in mutiple times. | ||
Here, target_oid will indicate a new temp_object in pgid_TEMP, | ||
so the issues are a bit difference. | ||
issue 1: As object is newly created, there is no need to deal with xattrs | ||
issue 2: As object is newly created, | ||
and object_map may not be transmitted depending on clean_omap. | ||
Therefore, if clean_omap is true, we need to clone object_map from original object. | ||
issue 3: As object is newly created, and unmodified data will not be transmitted. | ||
Therefore, we need to clone unmodified data from the original object. | ||
Thus, in this case, we should do: | ||
1) remove the temp object | ||
2) create a new temp object | ||
3) set alloc_hint for the new temp object | ||
4) truncate new temp object to recovery_info.size | ||
5) recovery omap_header | ||
6) clone object_map from original object if omap is clean | ||
7) clone unmodified object_data from original object | ||
8) punch zeros for the new temp object | ||
9) recovery xattrs, and recover omap if needed | ||
10) overwrite object content which is modified | ||
11) remove the original object | ||
12) move and rename the new temp object to replace the original object | ||
13) finish recovery |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -754,7 +754,8 @@ int DBObjectMap::get(const ghobject_t &oid, | |
Header header = lookup_map_header(hl, oid); | ||
if (!header) | ||
return -ENOENT; | ||
_get_header(header, _header); | ||
if (_header) | ||
_get_header(header, _header); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Separate commit. |
||
ObjectMapIterator iter = _get_iterator(header); | ||
for (iter->seek_to_first(); iter->valid(); iter->next()) { | ||
if (iter->status()) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2667,31 +2667,33 @@ void FileStore::_do_transaction( | |
|
||
case Transaction::OP_CLONERANGE: | ||
{ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, add tests for it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. test is added 5da53e4 |
||
coll_t cid = i.get_cid(op->cid); | ||
coll_t cid, ncid; | ||
cid = ncid = i.get_cid(op->cid); | ||
ghobject_t oid = i.get_oid(op->oid); | ||
_kludge_temp_object_collection(cid, oid); | ||
ghobject_t noid = i.get_oid(op->dest_oid); | ||
_kludge_temp_object_collection(cid, noid); | ||
_kludge_temp_object_collection(ncid, noid); | ||
uint64_t off = op->off; | ||
uint64_t len = op->len; | ||
tracepoint(objectstore, clone_range_enter, osr_name, len); | ||
r = _clone_range(cid, oid, noid, off, len, off, spos); | ||
r = _clone_range(cid, ncid, oid, noid, off, len, off, spos); | ||
tracepoint(objectstore, clone_range_exit, r); | ||
} | ||
break; | ||
|
||
case Transaction::OP_CLONERANGE2: | ||
{ | ||
coll_t cid = i.get_cid(op->cid); | ||
coll_t cid, ncid; | ||
cid = ncid = i.get_cid(op->cid); | ||
ghobject_t oid = i.get_oid(op->oid); | ||
_kludge_temp_object_collection(cid, oid); | ||
ghobject_t noid = i.get_oid(op->dest_oid); | ||
_kludge_temp_object_collection(cid, noid); | ||
_kludge_temp_object_collection(ncid, noid); | ||
uint64_t srcoff = op->off; | ||
uint64_t len = op->len; | ||
uint64_t dstoff = op->dest_off; | ||
tracepoint(objectstore, clone_range2_enter, osr_name, len); | ||
r = _clone_range(cid, oid, noid, srcoff, len, dstoff, spos); | ||
r = _clone_range(cid, ncid, oid, noid, srcoff, len, dstoff, spos); | ||
tracepoint(objectstore, clone_range2_exit, r); | ||
} | ||
break; | ||
|
@@ -3709,13 +3711,13 @@ int FileStore::_do_copy_range(int from, int to, uint64_t srcoff, uint64_t len, u | |
return r; | ||
} | ||
|
||
int FileStore::_clone_range(const coll_t& cid, const ghobject_t& oldoid, const ghobject_t& newoid, | ||
int FileStore::_clone_range(const coll_t& cid, const coll_t& ncid, const ghobject_t& oldoid, const ghobject_t& newoid, | ||
uint64_t srcoff, uint64_t len, uint64_t dstoff, | ||
const SequencerPosition& spos) | ||
{ | ||
dout(15) << "clone_range " << cid << "/" << oldoid << " -> " << cid << "/" << newoid << " " << srcoff << "~" << len << " to " << dstoff << dendl; | ||
|
||
if (_check_replay_guard(cid, newoid, spos) < 0) | ||
if (_check_replay_guard(ncid, newoid, spos) < 0) | ||
return 0; | ||
|
||
int r; | ||
|
@@ -3724,7 +3726,7 @@ int FileStore::_clone_range(const coll_t& cid, const ghobject_t& oldoid, const g | |
if (r < 0) { | ||
goto out2; | ||
} | ||
r = lfn_open(cid, newoid, true, &n); | ||
r = lfn_open(ncid, newoid, true, &n); | ||
if (r < 0) { | ||
goto out; | ||
} | ||
|
@@ -3741,7 +3743,7 @@ int FileStore::_clone_range(const coll_t& cid, const ghobject_t& oldoid, const g | |
out: | ||
lfn_close(o); | ||
out2: | ||
dout(10) << "clone_range " << cid << "/" << oldoid << " -> " << cid << "/" << newoid << " " | ||
dout(10) << "clone_range " << cid << "/" << oldoid << " -> " << ncid << "/" << newoid << " " | ||
<< srcoff << "~" << len << " to " << dstoff << " = " << r << dendl; | ||
return r; | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1112,19 +1112,27 @@ struct PGLog : DoutPrefixProvider { | |
<< "," << info.last_update << "]" << dendl; | ||
|
||
set<hobject_t, hobject_t::BitwiseComparator> did; | ||
set<hobject_t, hobject_t::BitwiseComparator> del; | ||
set<hobject_t, hobject_t::BitwiseComparator> checked; | ||
for (list<pg_log_entry_t>::reverse_iterator i = log.log.rbegin(); | ||
i != log.log.rend(); | ||
++i) { | ||
if (!debug_verify_stored_missing && i->version <= info.last_complete) break; | ||
if (cmp(i->soid, info.last_backfill, info.last_backfill_bitwise) > 0) | ||
continue; | ||
if (i->is_error()) | ||
continue; | ||
// merge clean_regions in pg_log_entry_t if needed | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This logic only comes into play if we are upgrading from the old encoding, right? This doesn't seem worthwhile, just fill out the missing set entry with the worst case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @athanatos not absolutely. It is mainly used for the case that the osd is down after new pg_log is transmitted. In this case, we should find out all missing_item based on the local pg_log because pg_log is transmitted but the object has not recovered yet. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think that's true. The recent missing_set changes always write down the missing items right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @athanatos , urh, yeah,you are right, current master modify read_log/write_log to read_log_and_missing/write_log_and_missing. So it is used to check whether our missing is correct as expected based on the flag debug_verify_stored_missing. And also, in this way, we need to consider retrieve pg_missing_item from the disk as you mention before. I think we can add a tail on the key such that p->key().substr(0, 7) == string("missing") && p->key().substr(7, 9) == string("v1") then decode new version and else decode old versions? Do you think this is reasonable? Also I agree that it is much easier to deal with cases here if encode/decode is wrapped by the MACRO. But as discussed before, is that worthwhile to change the interface of the decode so that we can also pass the feature bit into the decode function? |
||
// there is no need to merge the items before object is deleted | ||
if (!del.count(i->soid)) | ||
missing.merge(*i); | ||
if (i->version <= info.last_complete) continue; | ||
if (did.count(i->soid)) continue; | ||
did.insert(i->soid); | ||
|
||
if (i->is_delete()) continue; | ||
if (i->is_delete()) { | ||
del.insert(i->soid); | ||
continue; | ||
} | ||
|
||
bufferlist bv; | ||
int r = store->getattr( | ||
|
@@ -1136,15 +1144,17 @@ struct PGLog : DoutPrefixProvider { | |
object_info_t oi(bv); | ||
if (oi.version < i->version) { | ||
ldpp_dout(dpp, 15) << "read_log_and_missing missing " << *i | ||
<< " (have " << oi.version << ")" << dendl; | ||
<< " (have " << oi.version << ")" | ||
<< " clean_regions " << i->clean_regions << dendl; | ||
if (debug_verify_stored_missing) { | ||
auto miter = missing.get_items().find(i->soid); | ||
assert(miter != missing.get_items().end()); | ||
assert(miter->second.need == i->version); | ||
assert(miter->second.have == oi.version); | ||
checked.insert(i->soid); | ||
} else { | ||
missing.add(i->soid, i->version, oi.version); | ||
missing.add(i->soid, i->version, oi.version, false); | ||
missing.merge(*i); | ||
} | ||
} | ||
} else { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is awkward to add a new interface clone_omap in ObjectStore. Furthermore, clone_omap will change the origin omap to parent and generate two children. Even if the first object is deleted, the parent and its second child is still existed unless parent ref dec to zero (i.e. remove the clone object).
So I would like to retransmit the omap content if recovery cannot be finished in one PushOps, is that reasonable?
@athanatos
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me. In the common case, it won't be present if not modified anyway (either it's empty, or it's involved in pretty much every update).