ceph · mslovy · Dec 26, 2015 · Jul 29, 2016 · May 9, 2016 · Aug 14, 2016
diff --git a/doc/dev/osd_internals/partial_object_recovery.rst b/doc/dev/osd_internals/partial_object_recovery.rst
@@ -0,0 +1,121 @@
+====================
+Partial Object Recovery
+====================
+
+Partial Object Recovery devotes to improving the efficiency of
+log-based recovery rather than backfill. Original log-based recovery
+calculates missing_set based on the difference between pg_log.
+The whole object should be recovery from one OSD to another
+if the object is indicated modified by pg_log regardless of how much
+content in the object is really modified. That means a 4M object,
+which is just modified 4k inside, should be recovery the whole 4M object
+rather than the modified 4k content. In addition, object map should be
+also recovered even if it is not modified at all.
+
+Partial Object Recovery is designed to solve the problem mentioned above.
+In order to achieve the goals, two things should be done:
+    1. logging where the object is modified is neccessary
+    2. logging whether the object_map of an object is modified is also neccessary
+
+class ObjectCleanRegion is introduced to do what we want.
+clean_offsets is a variable of interval_set<uint64_t>
+and is used to indicate the unmodified content in an object.
+clean_omap is a varible of bool indicating whether object_map is modified.
+max_num_intervals is an upbound of the number of intervals in clean_offsets
+so that the memory cost of clean_offsets is always bounded.
+The shortest clean interval will be trimmed if the number of intervals
+in clean_offsets exceeds the boundary.
+    etc. max_num_intervals=2, clean_offsets:{[5~10], [20~5]}
+    then new interval [30~10] will evict out the shortest one [20~5]
+    finally, clean_offsets becomes {[5~10], [30~10]}
+
+
+Procedures for Partial Object Recovery
+
+Firstly, OpContext and pg_log_entry_t should contain ObjectCleanRegion.
+In do_osd_ops(), finish_copyfrom(), finish_promote(), corresponding content
+in ObjectCleanRegion should mark dirty so that trace the modification of an object.
+Also update ObjectCleanRegion in OpContext to its pg_log_entry_t.
+
+Secondly, pg_missing_set can build and rebuild correctly.
+when calculating pg_missing_set during peering process,
+also merge ObjectCleanRegion in each pg_log_entry_t.
+    etc. object aa has pg_log:
+        26'101 {[0~4096, 8192~MAX], false}
+        26'104 {0~8192, 12288~MAX, false}
+        28'108 {[0~12288, 16384~MAX], true}
+    missing_set for object aa: merge pg_log above --> {[0~4096, 16384~MAX], true}
+    which means 4096~16384 is modified and object_map is also modified on version 28'108
+Also, OSD may be crash after merge log.
+Therefore, we need to read_log and rebuild pg_missing_set. For example, pg_log is
+    object aa: 26'101 {[0~4096, 8192~MAX], false}
+    object bb: 26'102 {[0~4096, 8192~MAX], false}
+    object cc: 26'103 {[0~4096, 8192~MAX], false}
+    object aa: 26'104 {0~8192, 12288~MAX, false}
+    object dd: 26'105 {[0~4096, 8192~MAX], false}
+    object aa: 28'108 {[0~12288, 16384~MAX], true}
+Orignally, if bb,cc,dd is recovered, and aa is not.
+So we need to rebuild pg_missing_set for object aa,
+and find aa is modified on version 28'108.
+If version in object_info is 26'96 < 28'108,
+we don't need to consider 26'104 and 26'101 because the whole object will be recovered.
+However, Partial Object Recovery should also require us to rebuild ObjectCleanRegion.
+Knowing whether the object is modified is not enough.
+Therefore, we also need to tranverse the pg_log before,
+that says 26'104 and 26'101 also > object_info(26'96)
+and rebuild pg_missing_set for object aa based on those three logs: 28'108, 26'104, 26'101.
+The way how to merge logs is the same as mentioned above
+
+Finally, finish the push and pull process based on pg_missing_set.
+Updating copy_subset in recovery_info based on ObjectCleanRegion in pg_missing_set.
+copy_subset indicates the intervals of content need to pull and push.
+The complicated part here is submit_push_data
+and serval cases should be considered seperately.
+what we need to consider is how to deal with the object data,
+object data makes up of omap_header, xattrs, omap, data:
+
+case 1 -- first && complete: since object recovering is finished in a single PushOp,
+we would like to preserve the original object and overwrite on the object directly.
+Object will not be removed and touch a new one.
+    issue 1: As object is not removed, old xattrs remain in the old object
+    but maybe updated in new object.
+    Overwriting for the same key is correct or adding new keys is correct,
+    but removing keys will be wrong. In order to solve this issue,
+    we need to remove the all original xattrs in the object, and then update new xattrs.
+    issue 2: As object is not removed,
+    object_map may be recovered depending on the clean_omap.
+    Therefore, if recovering clean_omap, we need to remove old omap of the object for the same reason
+    since omap updating may also be a deletion.
+    Thus, in this case, we should do:
+        1) clear xattrs of the object
+        2) clear omap of the object if omap recovery is needed
+        3) truncate the object into recovery_info.size
+        4) recovery omap_header
+        5) recovery xattrs, and recover omap if needed
+        6) punch zeros for original object if fiemap tells nothing there
+        7) overwrite object content which is modified
+        8) finish recovery
+
+case 2 -- first && !complete: object recovering should be done in mutiple times.
+Here, target_oid will indicate a new temp_object in pgid_TEMP,
+so the issues are a bit difference.
+    issue 1: As object is newly created, there is no need to deal with xattrs
+    issue 2: As object is newly created,
+    and object_map may not be transmitted depending on clean_omap.
+    Therefore, if clean_omap is true, we need to clone object_map from original object.
+    issue 3: As object is newly created, and unmodified data will not be transmitted.
+    Therefore, we need to clone unmodified data from the original object.
+    Thus, in this case, we should do:
+        1) remove the temp object
+        2) create a new temp object
+        3) set alloc_hint for the new temp object
+        4) truncate new temp object to recovery_info.size
+        5) recovery omap_header
+        6) clone object_map from original object if omap is clean
+        7) clone unmodified object_data from original object
+        8) punch zeros for the new temp object
+        9) recovery xattrs, and recover omap if needed
+        10) overwrite object content which is modified
+        11) remove the original object
+        12) move and rename the new temp object to replace the original object
+        13) finish recovery
diff --git a/src/include/ceph_features.h b/src/include/ceph_features.h
@@ -86,6 +86,7 @@
 #define CEPH_FEATURE_FS_FILE_LAYOUT_V2       (1ULL<<58) /* file_layout_t */
 #define CEPH_FEATURE_FS_BTIME                (1ULL<<59) /* btime */
 #define CEPH_FEATURE_FS_CHANGE_ATTR          (1ULL<<59) /* change_attr */
+#define CEPH_OSD_PARTIAL_RECOVERY  (1ULL<<60) /* recover partial extents for objects */
 
 #define CEPH_FEATURE_RESERVED2 (1ULL<<61)  /* slow down, we are almost out... */
 #define CEPH_FEATURE_RESERVED  (1ULL<<62)  /* DO NOT USE THIS ... last bit! */
@@ -184,6 +185,7 @@ static inline unsigned long long ceph_sanitize_features(unsigned long long f) {
 	 CEPH_FEATURE_SERVER_KRAKEN |	\
 	 CEPH_FEATURE_FS_BTIME |			 \
 	 CEPH_FEATURE_FS_CHANGE_ATTR |			 \
+	 CEPH_OSD_PARTIAL_RECOVERY |	 \
 	 0ULL)
 
 #define CEPH_FEATURES_SUPPORTED_DEFAULT  CEPH_FEATURES_ALL

diff --git a/src/include/interval_set.h b/src/include/interval_set.h
@@ -234,7 +234,7 @@ class interval_set {
     return _size == other._size && m == other.m;
   }
 
-  int size() const {
+  uint64_t size() const {
     return _size;
   }
 
@@ -541,7 +541,7 @@ class interval_set {
 
 private:
   // data
-  int64_t _size;
+  uint64_t _size;
   std::map<T,T> m;   // map start -> len
 };
 

diff --git a/src/messages/MOSDPGLog.h b/src/messages/MOSDPGLog.h
@@ -76,7 +76,7 @@ class MOSDPGLog : public Message {
     ::encode(epoch, payload);
     ::encode(info, payload);
     ::encode(log, payload);
-    ::encode(missing, payload);
+    ::encode(missing, payload, features);
     ::encode(query_epoch, payload);
     ::encode(past_intervals, payload);
     ::encode(to, payload);

diff --git a/src/messages/MOSDPGMissing.h b/src/messages/MOSDPGMissing.h
@@ -44,7 +44,7 @@ class MOSDPGMissing : public Message {
   void encode_payload(uint64_t features) {
     ::encode(epoch, payload);
     ::encode(info, payload);
-    ::encode(missing, payload);
+    ::encode(missing, payload, features);
   }
   void decode_payload() {
     bufferlist::iterator p = payload.begin();

diff --git a/src/os/filestore/DBObjectMap.cc b/src/os/filestore/DBObjectMap.cc
@@ -754,7 +754,8 @@ int DBObjectMap::get(const ghobject_t &oid,
   Header header = lookup_map_header(hl, oid);
   if (!header)
     return -ENOENT;
-  _get_header(header, _header);
+  if (_header)
+    _get_header(header, _header);
   ObjectMapIterator iter = _get_iterator(header);
   for (iter->seek_to_first(); iter->valid(); iter->next()) {
     if (iter->status())

diff --git a/src/os/filestore/FileStore.cc b/src/os/filestore/FileStore.cc
@@ -2667,31 +2667,33 @@ void FileStore::_do_transaction(
 
     case Transaction::OP_CLONERANGE:
       {
-        coll_t cid = i.get_cid(op->cid);
+        coll_t cid, ncid;
+        cid = ncid = i.get_cid(op->cid);
         ghobject_t oid = i.get_oid(op->oid);
 	_kludge_temp_object_collection(cid, oid);
         ghobject_t noid = i.get_oid(op->dest_oid);
-	_kludge_temp_object_collection(cid, noid);
+	_kludge_temp_object_collection(ncid, noid);
         uint64_t off = op->off;
         uint64_t len = op->len;
         tracepoint(objectstore, clone_range_enter, osr_name, len);
-        r = _clone_range(cid, oid, noid, off, len, off, spos);
+        r = _clone_range(cid, ncid, oid, noid, off, len, off, spos);
         tracepoint(objectstore, clone_range_exit, r);
       }
       break;
 
     case Transaction::OP_CLONERANGE2:
       {
-        coll_t cid = i.get_cid(op->cid);
+        coll_t cid, ncid;
+        cid = ncid = i.get_cid(op->cid);
         ghobject_t oid = i.get_oid(op->oid);
 	_kludge_temp_object_collection(cid, oid);
         ghobject_t noid = i.get_oid(op->dest_oid);
-	_kludge_temp_object_collection(cid, noid);
+	_kludge_temp_object_collection(ncid, noid);
         uint64_t srcoff = op->off;
         uint64_t len = op->len;
         uint64_t dstoff = op->dest_off;
         tracepoint(objectstore, clone_range2_enter, osr_name, len);
-        r = _clone_range(cid, oid, noid, srcoff, len, dstoff, spos);
+        r = _clone_range(cid, ncid, oid, noid, srcoff, len, dstoff, spos);
         tracepoint(objectstore, clone_range2_exit, r);
       }
       break;
@@ -3709,13 +3711,13 @@ int FileStore::_do_copy_range(int from, int to, uint64_t srcoff, uint64_t len, u
   return r;
 }
 
-int FileStore::_clone_range(const coll_t& cid, const ghobject_t& oldoid, const ghobject_t& newoid,
+int FileStore::_clone_range(const coll_t& cid, const coll_t& ncid, const ghobject_t& oldoid, const ghobject_t& newoid,
 			    uint64_t srcoff, uint64_t len, uint64_t dstoff,
 			    const SequencerPosition& spos)
 {
   dout(15) << "clone_range " << cid << "/" << oldoid << " -> " << cid << "/" << newoid << " " << srcoff << "~" << len << " to " << dstoff << dendl;
 
-  if (_check_replay_guard(cid, newoid, spos) < 0)
+  if (_check_replay_guard(ncid, newoid, spos) < 0)
     return 0;
 
   int r;
@@ -3724,7 +3726,7 @@ int FileStore::_clone_range(const coll_t& cid, const ghobject_t& oldoid, const g
   if (r < 0) {
     goto out2;
   }
-  r = lfn_open(cid, newoid, true, &n);
+  r = lfn_open(ncid, newoid, true, &n);
   if (r < 0) {
     goto out;
   }
@@ -3741,7 +3743,7 @@ int FileStore::_clone_range(const coll_t& cid, const ghobject_t& oldoid, const g
  out:
   lfn_close(o);
  out2:
-  dout(10) << "clone_range " << cid << "/" << oldoid << " -> " << cid << "/" << newoid << " "
+  dout(10) << "clone_range " << cid << "/" << oldoid << " -> " << ncid << "/" << newoid << " "
 	   << srcoff << "~" << len << " to " << dstoff << " = " << r << dendl;
   return r;
 }

diff --git a/src/os/filestore/FileStore.h b/src/os/filestore/FileStore.h
@@ -557,7 +557,7 @@ class FileStore : public JournalingObjectStore,
   int _truncate(const coll_t& cid, const ghobject_t& oid, uint64_t size);
   int _clone(const coll_t& cid, const ghobject_t& oldoid, const ghobject_t& newoid,
 	     const SequencerPosition& spos);
-  int _clone_range(const coll_t& cid, const ghobject_t& oldoid, const ghobject_t& newoid,
+  int _clone_range(const coll_t& cid, const coll_t& ncid, const ghobject_t& oldoid, const ghobject_t& newoid,
 		   uint64_t srcoff, uint64_t len, uint64_t dstoff,
 		   const SequencerPosition& spos);
   int _do_clone_range(int from, int to, uint64_t srcoff, uint64_t len, uint64_t dstoff);

diff --git a/src/osd/PGLog.h b/src/osd/PGLog.h
@@ -1112,19 +1112,27 @@ struct PGLog : DoutPrefixProvider {
 			   << "," << info.last_update << "]" << dendl;
 
 	set<hobject_t, hobject_t::BitwiseComparator> did;
+	set<hobject_t, hobject_t::BitwiseComparator> del;
 	set<hobject_t, hobject_t::BitwiseComparator> checked;
 	for (list<pg_log_entry_t>::reverse_iterator i = log.log.rbegin();
 	     i != log.log.rend();
 	     ++i) {
-	  if (!debug_verify_stored_missing && i->version <= info.last_complete) break;
 	  if (cmp(i->soid, info.last_backfill, info.last_backfill_bitwise) > 0)
 	    continue;
 	  if (i->is_error())
 	    continue;
+	  // merge clean_regions in pg_log_entry_t if needed
+	  // there is no need to merge the items before object is deleted
+	  if (!del.count(i->soid))
+	    missing.merge(*i);
+	  if (i->version <= info.last_complete) continue;
 	  if (did.count(i->soid)) continue;
 	  did.insert(i->soid);
 
-	  if (i->is_delete()) continue;
+	  if (i->is_delete()) {
+	    del.insert(i->soid);
+	    continue;
+	  }
 
 	  bufferlist bv;
 	  int r = store->getattr(
@@ -1136,15 +1144,17 @@ struct PGLog : DoutPrefixProvider {
 	    object_info_t oi(bv);
 	    if (oi.version < i->version) {
 	      ldpp_dout(dpp, 15) << "read_log_and_missing  missing " << *i
-				 << " (have " << oi.version << ")" << dendl;
+				 << " (have " << oi.version << ")"
+				 << " clean_regions " << i->clean_regions << dendl;
 	      if (debug_verify_stored_missing) {
 		auto miter = missing.get_items().find(i->soid);
 		assert(miter != missing.get_items().end());
 		assert(miter->second.need == i->version);
 		assert(miter->second.have == oi.version);
 		checked.insert(i->soid);
 	      } else {
-		missing.add(i->soid, i->version, oi.version);
+		missing.add(i->soid, i->version, oi.version, false);
+		missing.merge(*i);
 	      }
 	    }
 	  } else {