Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNM: osd: recovery optimization for overwrite Ops #7325

Closed
wants to merge 18 commits into from
Closed
Show file tree
Hide file tree
Changes from 17 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
121 changes: 121 additions & 0 deletions doc/dev/osd_internals/partial_object_recovery.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
====================
Partial Object Recovery
====================

Partial Object Recovery devotes to improving the efficiency of
log-based recovery rather than backfill. Original log-based recovery
calculates missing_set based on the difference between pg_log.
The whole object should be recovery from one OSD to another
if the object is indicated modified by pg_log regardless of how much
content in the object is really modified. That means a 4M object,
which is just modified 4k inside, should be recovery the whole 4M object
rather than the modified 4k content. In addition, object map should be
also recovered even if it is not modified at all.

Partial Object Recovery is designed to solve the problem mentioned above.
In order to achieve the goals, two things should be done:
1. logging where the object is modified is neccessary
2. logging whether the object_map of an object is modified is also neccessary

class ObjectCleanRegion is introduced to do what we want.
clean_offsets is a variable of interval_set<uint64_t>
and is used to indicate the unmodified content in an object.
clean_omap is a varible of bool indicating whether object_map is modified.
max_num_intervals is an upbound of the number of intervals in clean_offsets
so that the memory cost of clean_offsets is always bounded.
The shortest clean interval will be trimmed if the number of intervals
in clean_offsets exceeds the boundary.
etc. max_num_intervals=2, clean_offsets:{[5~10], [20~5]}
then new interval [30~10] will evict out the shortest one [20~5]
finally, clean_offsets becomes {[5~10], [30~10]}


Procedures for Partial Object Recovery

Firstly, OpContext and pg_log_entry_t should contain ObjectCleanRegion.
In do_osd_ops(), finish_copyfrom(), finish_promote(), corresponding content
in ObjectCleanRegion should mark dirty so that trace the modification of an object.
Also update ObjectCleanRegion in OpContext to its pg_log_entry_t.

Secondly, pg_missing_set can build and rebuild correctly.
when calculating pg_missing_set during peering process,
also merge ObjectCleanRegion in each pg_log_entry_t.
etc. object aa has pg_log:
26'101 {[0~4096, 8192~MAX], false}
26'104 {0~8192, 12288~MAX, false}
28'108 {[0~12288, 16384~MAX], true}
missing_set for object aa: merge pg_log above --> {[0~4096, 16384~MAX], true}
which means 4096~16384 is modified and object_map is also modified on version 28'108
Also, OSD may be crash after merge log.
Therefore, we need to read_log and rebuild pg_missing_set. For example, pg_log is
object aa: 26'101 {[0~4096, 8192~MAX], false}
object bb: 26'102 {[0~4096, 8192~MAX], false}
object cc: 26'103 {[0~4096, 8192~MAX], false}
object aa: 26'104 {0~8192, 12288~MAX, false}
object dd: 26'105 {[0~4096, 8192~MAX], false}
object aa: 28'108 {[0~12288, 16384~MAX], true}
Orignally, if bb,cc,dd is recovered, and aa is not.
So we need to rebuild pg_missing_set for object aa,
and find aa is modified on version 28'108.
If version in object_info is 26'96 < 28'108,
we don't need to consider 26'104 and 26'101 because the whole object will be recovered.
However, Partial Object Recovery should also require us to rebuild ObjectCleanRegion.
Knowing whether the object is modified is not enough.
Therefore, we also need to tranverse the pg_log before,
that says 26'104 and 26'101 also > object_info(26'96)
and rebuild pg_missing_set for object aa based on those three logs: 28'108, 26'104, 26'101.
The way how to merge logs is the same as mentioned above

Finally, finish the push and pull process based on pg_missing_set.
Updating copy_subset in recovery_info based on ObjectCleanRegion in pg_missing_set.
copy_subset indicates the intervals of content need to pull and push.
The complicated part here is submit_push_data
and serval cases should be considered seperately.
what we need to consider is how to deal with the object data,
object data makes up of omap_header, xattrs, omap, data:

case 1 -- first && complete: since object recovering is finished in a single PushOp,
we would like to preserve the original object and overwrite on the object directly.
Object will not be removed and touch a new one.
issue 1: As object is not removed, old xattrs remain in the old object
but maybe updated in new object.
Overwriting for the same key is correct or adding new keys is correct,
but removing keys will be wrong. In order to solve this issue,
we need to remove the all original xattrs in the object, and then update new xattrs.
issue 2: As object is not removed,
object_map may be recovered depending on the clean_omap.
Therefore, if recovering clean_omap, we need to remove old omap of the object for the same reason
since omap updating may also be a deletion.
Thus, in this case, we should do:
1) clear xattrs of the object
2) clear omap of the object if omap recovery is needed
3) truncate the object into recovery_info.size
4) recovery omap_header
5) recovery xattrs, and recover omap if needed
6) punch zeros for original object if fiemap tells nothing there
7) overwrite object content which is modified
8) finish recovery

case 2 -- first && !complete: object recovering should be done in mutiple times.
Here, target_oid will indicate a new temp_object in pgid_TEMP,
so the issues are a bit difference.
issue 1: As object is newly created, there is no need to deal with xattrs
issue 2: As object is newly created,
and object_map may not be transmitted depending on clean_omap.
Therefore, if clean_omap is true, we need to clone object_map from original object.
issue 3: As object is newly created, and unmodified data will not be transmitted.
Therefore, we need to clone unmodified data from the original object.
Thus, in this case, we should do:
1) remove the temp object
2) create a new temp object
3) set alloc_hint for the new temp object
4) truncate new temp object to recovery_info.size
5) recovery omap_header
6) clone object_map from original object if omap is clean
Copy link
Contributor Author

@mslovy mslovy Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is awkward to add a new interface clone_omap in ObjectStore. Furthermore, clone_omap will change the origin omap to parent and generate two children. Even if the first object is deleted, the parent and its second child is still existed unless parent ref dec to zero (i.e. remove the clone object).
So I would like to retransmit the omap content if recovery cannot be finished in one PushOps, is that reasonable?
@athanatos

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me. In the common case, it won't be present if not modified anyway (either it's empty, or it's involved in pretty much every update).

7) clone unmodified object_data from original object
8) punch zeros for the new temp object
9) recovery xattrs, and recover omap if needed
10) overwrite object content which is modified
11) remove the original object
12) move and rename the new temp object to replace the original object
13) finish recovery
2 changes: 2 additions & 0 deletions src/include/ceph_features.h
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,7 @@
#define CEPH_FEATURE_FS_FILE_LAYOUT_V2 (1ULL<<58) /* file_layout_t */
#define CEPH_FEATURE_FS_BTIME (1ULL<<59) /* btime */
#define CEPH_FEATURE_FS_CHANGE_ATTR (1ULL<<59) /* change_attr */
#define CEPH_OSD_PARTIAL_RECOVERY (1ULL<<60) /* recover partial extents for objects */

#define CEPH_FEATURE_RESERVED2 (1ULL<<61) /* slow down, we are almost out... */
#define CEPH_FEATURE_RESERVED (1ULL<<62) /* DO NOT USE THIS ... last bit! */
Expand Down Expand Up @@ -184,6 +185,7 @@ static inline unsigned long long ceph_sanitize_features(unsigned long long f) {
CEPH_FEATURE_SERVER_KRAKEN | \
CEPH_FEATURE_FS_BTIME | \
CEPH_FEATURE_FS_CHANGE_ATTR | \
CEPH_OSD_PARTIAL_RECOVERY | \
0ULL)

#define CEPH_FEATURES_SUPPORTED_DEFAULT CEPH_FEATURES_ALL
Expand Down
4 changes: 2 additions & 2 deletions src/include/interval_set.h
Original file line number Diff line number Diff line change
Expand Up @@ -234,7 +234,7 @@ class interval_set {
return _size == other._size && m == other.m;
}

int size() const {
uint64_t size() const {
return _size;
}

Expand Down Expand Up @@ -541,7 +541,7 @@ class interval_set {

private:
// data
int64_t _size;
uint64_t _size;
std::map<T,T> m; // map start -> len
};

Expand Down
2 changes: 1 addition & 1 deletion src/messages/MOSDPGLog.h
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ class MOSDPGLog : public Message {
::encode(epoch, payload);
::encode(info, payload);
::encode(log, payload);
::encode(missing, payload);
::encode(missing, payload, features);
::encode(query_epoch, payload);
::encode(past_intervals, payload);
::encode(to, payload);
Expand Down
2 changes: 1 addition & 1 deletion src/messages/MOSDPGMissing.h
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ class MOSDPGMissing : public Message {
void encode_payload(uint64_t features) {
::encode(epoch, payload);
::encode(info, payload);
::encode(missing, payload);
::encode(missing, payload, features);
}
void decode_payload() {
bufferlist::iterator p = payload.begin();
Expand Down
3 changes: 2 additions & 1 deletion src/os/filestore/DBObjectMap.cc
Original file line number Diff line number Diff line change
Expand Up @@ -754,7 +754,8 @@ int DBObjectMap::get(const ghobject_t &oid,
Header header = lookup_map_header(hl, oid);
if (!header)
return -ENOENT;
_get_header(header, _header);
if (_header)
_get_header(header, _header);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separate commit.

ObjectMapIterator iter = _get_iterator(header);
for (iter->seek_to_first(); iter->valid(); iter->next()) {
if (iter->status())
Expand Down
22 changes: 12 additions & 10 deletions src/os/filestore/FileStore.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2667,31 +2667,33 @@ void FileStore::_do_transaction(

case Transaction::OP_CLONERANGE:
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, add tests for it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test is added 5da53e4

coll_t cid = i.get_cid(op->cid);
coll_t cid, ncid;
cid = ncid = i.get_cid(op->cid);
ghobject_t oid = i.get_oid(op->oid);
_kludge_temp_object_collection(cid, oid);
ghobject_t noid = i.get_oid(op->dest_oid);
_kludge_temp_object_collection(cid, noid);
_kludge_temp_object_collection(ncid, noid);
uint64_t off = op->off;
uint64_t len = op->len;
tracepoint(objectstore, clone_range_enter, osr_name, len);
r = _clone_range(cid, oid, noid, off, len, off, spos);
r = _clone_range(cid, ncid, oid, noid, off, len, off, spos);
tracepoint(objectstore, clone_range_exit, r);
}
break;

case Transaction::OP_CLONERANGE2:
{
coll_t cid = i.get_cid(op->cid);
coll_t cid, ncid;
cid = ncid = i.get_cid(op->cid);
ghobject_t oid = i.get_oid(op->oid);
_kludge_temp_object_collection(cid, oid);
ghobject_t noid = i.get_oid(op->dest_oid);
_kludge_temp_object_collection(cid, noid);
_kludge_temp_object_collection(ncid, noid);
uint64_t srcoff = op->off;
uint64_t len = op->len;
uint64_t dstoff = op->dest_off;
tracepoint(objectstore, clone_range2_enter, osr_name, len);
r = _clone_range(cid, oid, noid, srcoff, len, dstoff, spos);
r = _clone_range(cid, ncid, oid, noid, srcoff, len, dstoff, spos);
tracepoint(objectstore, clone_range2_exit, r);
}
break;
Expand Down Expand Up @@ -3709,13 +3711,13 @@ int FileStore::_do_copy_range(int from, int to, uint64_t srcoff, uint64_t len, u
return r;
}

int FileStore::_clone_range(const coll_t& cid, const ghobject_t& oldoid, const ghobject_t& newoid,
int FileStore::_clone_range(const coll_t& cid, const coll_t& ncid, const ghobject_t& oldoid, const ghobject_t& newoid,
uint64_t srcoff, uint64_t len, uint64_t dstoff,
const SequencerPosition& spos)
{
dout(15) << "clone_range " << cid << "/" << oldoid << " -> " << cid << "/" << newoid << " " << srcoff << "~" << len << " to " << dstoff << dendl;

if (_check_replay_guard(cid, newoid, spos) < 0)
if (_check_replay_guard(ncid, newoid, spos) < 0)
return 0;

int r;
Expand All @@ -3724,7 +3726,7 @@ int FileStore::_clone_range(const coll_t& cid, const ghobject_t& oldoid, const g
if (r < 0) {
goto out2;
}
r = lfn_open(cid, newoid, true, &n);
r = lfn_open(ncid, newoid, true, &n);
if (r < 0) {
goto out;
}
Expand All @@ -3741,7 +3743,7 @@ int FileStore::_clone_range(const coll_t& cid, const ghobject_t& oldoid, const g
out:
lfn_close(o);
out2:
dout(10) << "clone_range " << cid << "/" << oldoid << " -> " << cid << "/" << newoid << " "
dout(10) << "clone_range " << cid << "/" << oldoid << " -> " << ncid << "/" << newoid << " "
<< srcoff << "~" << len << " to " << dstoff << " = " << r << dendl;
return r;
}
Expand Down
2 changes: 1 addition & 1 deletion src/os/filestore/FileStore.h
Original file line number Diff line number Diff line change
Expand Up @@ -557,7 +557,7 @@ class FileStore : public JournalingObjectStore,
int _truncate(const coll_t& cid, const ghobject_t& oid, uint64_t size);
int _clone(const coll_t& cid, const ghobject_t& oldoid, const ghobject_t& newoid,
const SequencerPosition& spos);
int _clone_range(const coll_t& cid, const ghobject_t& oldoid, const ghobject_t& newoid,
int _clone_range(const coll_t& cid, const coll_t& ncid, const ghobject_t& oldoid, const ghobject_t& newoid,
uint64_t srcoff, uint64_t len, uint64_t dstoff,
const SequencerPosition& spos);
int _do_clone_range(int from, int to, uint64_t srcoff, uint64_t len, uint64_t dstoff);
Expand Down
18 changes: 14 additions & 4 deletions src/osd/PGLog.h
Original file line number Diff line number Diff line change
Expand Up @@ -1112,19 +1112,27 @@ struct PGLog : DoutPrefixProvider {
<< "," << info.last_update << "]" << dendl;

set<hobject_t, hobject_t::BitwiseComparator> did;
set<hobject_t, hobject_t::BitwiseComparator> del;
set<hobject_t, hobject_t::BitwiseComparator> checked;
for (list<pg_log_entry_t>::reverse_iterator i = log.log.rbegin();
i != log.log.rend();
++i) {
if (!debug_verify_stored_missing && i->version <= info.last_complete) break;
if (cmp(i->soid, info.last_backfill, info.last_backfill_bitwise) > 0)
continue;
if (i->is_error())
continue;
// merge clean_regions in pg_log_entry_t if needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic only comes into play if we are upgrading from the old encoding, right? This doesn't seem worthwhile, just fill out the missing set entry with the worst case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos not absolutely. It is mainly used for the case that the osd is down after new pg_log is transmitted. In this case, we should find out all missing_item based on the local pg_log because pg_log is transmitted but the object has not recovered yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true. The recent missing_set changes always write down the missing items right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@athanatos , urh, yeah,you are right, current master modify read_log/write_log to read_log_and_missing/write_log_and_missing. So it is used to check whether our missing is correct as expected based on the flag debug_verify_stored_missing.

And also, in this way, we need to consider retrieve pg_missing_item from the disk as you mention before. I think we can add a tail on the key such that p->key().substr(0, 7) == string("missing") && p->key().substr(7, 9) == string("v1") then decode new version and else decode old versions? Do you think this is reasonable?

Also I agree that it is much easier to deal with cases here if encode/decode is wrapped by the MACRO. But as discussed before, is that worthwhile to change the interface of the decode so that we can also pass the feature bit into the decode function?

// there is no need to merge the items before object is deleted
if (!del.count(i->soid))
missing.merge(*i);
if (i->version <= info.last_complete) continue;
if (did.count(i->soid)) continue;
did.insert(i->soid);

if (i->is_delete()) continue;
if (i->is_delete()) {
del.insert(i->soid);
continue;
}

bufferlist bv;
int r = store->getattr(
Expand All @@ -1136,15 +1144,17 @@ struct PGLog : DoutPrefixProvider {
object_info_t oi(bv);
if (oi.version < i->version) {
ldpp_dout(dpp, 15) << "read_log_and_missing missing " << *i
<< " (have " << oi.version << ")" << dendl;
<< " (have " << oi.version << ")"
<< " clean_regions " << i->clean_regions << dendl;
if (debug_verify_stored_missing) {
auto miter = missing.get_items().find(i->soid);
assert(miter != missing.get_items().end());
assert(miter->second.need == i->version);
assert(miter->second.have == oi.version);
checked.insert(i->soid);
} else {
missing.add(i->soid, i->version, oi.version);
missing.add(i->soid, i->version, oi.version, false);
missing.merge(*i);
}
}
} else {
Expand Down
Loading