osd,librados: add manifest, operations for chunked object #15482

myoungwon · 2017-06-05T10:56:09Z

As discussed with @liewegas, These commits are the second stage (chunked manifest) for deduplication
(http://pad.ceph.com/p/deduplication_how_dedup_manifists,
http://pad.ceph.com/p/deduplication_how_do_we_store_chunk)

Signed-off-by: Myoungwon Oh omwmw@sk.com

myoungwon · 2017-06-05T10:58:14Z

@liewegas Could you review these commits?
This can be passed following test.

rados -p rbd set-chunk chunk_test(object name) --target-pool sds-hot 131072
rados -p rbd put chunk_test ./test1
rados -p rbd get chunk_test ./test2
diff test1 test2

But some operations such as copy_from are difficult to handle proxy read/write for the chunked object. Therefore, I think the writeback(dedup) process (the base pool absorbs all writes and this process convert data to chunked data.) is needed in order to support such operation. (http://pad.ceph.com/p/deduplication_how_to_drive_dedup_process).

What do you think? Do we need to support all of the operations at this stage?

liewegas · 2017-06-05T13:18:41Z

src/osd/PrimaryLogPG.cc

+	  oi.manifest.chunk_map[cursor] = chunk_info;
+	}
+
+	oi.set_flag(object_info_t::FLAG_MANIFEST);


I wonder if we should have something like

if (!oi.manifest.is_chunked()) { oi.manifest.clear(); }

so that the redirect target is cleared if it was a redirect (or any other future stuff that goes into the manifest is cleaned up).

liewegas · 2017-06-05T13:40:52Z

src/osd/PrimaryLogPG.cc

+	   << " req_length: " << req_length << dendl;
+
+  osd_reqid_t reqid = m->get_reqid();
+  reqid.inc+=req_offset;


I don't think this will work. inc == incarnation, and it's there because the mds entity_name_t is something like mds.0, mds.1, and it's re-used after failover (with inc changing). We could change that so that the requests come from mds.$gid (which is unique) instead of mds.$rank), and then repurpose the inc field for this.. that is probably my vote, actually. @jcsp does that work for you?

@liewegas Right. Anyway, we need to create a unique osd_reqid_t in order to avoid dup op in do_op().
I will investigate this.

I suggest using inc = original reqid.inc + chunk_index + 1 and we'll resolve the mds use of inc separately

Thanks for your suggestion. Fixed.

liewegas · 2017-06-05T13:41:59Z

src/osd/PrimaryLogPG.cc

+  SnapContext snapc(m->get_snap_seq(), m->get_snaps());
+  object_manifest_t *manifest = &obc->obs.oi.manifest;
+  uint64_t chunk_length = manifest->chunk_map[0].length;
+  uint64_t chunk_index = req_offset - (req_offset % chunk_length);


This is baking in teh assumption that chunks are fixed-size. Since chunk_map is actually a map, can't we just iterate over the map and send a request for each chunk, with no assumptions about the chunk sizes?

Right. Fixed.

liewegas · 2017-06-05T13:46:39Z

src/osd/PrimaryLogPG.cc

+    chunk_name = chunk_name + "_" + to_string(chunk_index);
+    manifest->chunk_map[chunk_index] = manifest->chunk_map[0];
+    manifest->chunk_map[chunk_index].oid.oid = object_t(chunk_name);
+    manifest->chunk_map[chunk_index].flags = chunk_info_t::CHUNK_CLEAN;


This is baking in a specific chunking strategy for arbitrary writes. I'm not sure we should do that yet until we have some idea how to describe that policy?

I wonder if, instead, we want some clear to indicate that this particular chunk is local and 'dirty'. Which makes me question what the CLEAN and DIRTY flags really mean; if a particular chunk is DIRTY that means the local copy is newer, in which case does it even make sense to say that there is a remote copy that is stale? Maybe for deferred cleanup of the ref or something?) If not, we really have two types of chunks, more like LOCAL and REMOTE?

Either way, for a minimal next step, maybe we simply implement the chunked read and ignore the chunked write case? A simplistic approach would be for a write to a chunked object to trigger a promote and then a local write.

liewegas · 2017-06-05T13:49:55Z

src/osd/PrimaryLogPG.cc

+	case CEPH_OSD_OP_CHECKSUM:
+	  op.flags = (op.flags | CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL) &
+		       ~(CEPH_OSD_OP_FLAG_FADVISE_DONTNEED | CEPH_OSD_OP_FLAG_FADVISE_NOCACHE);
+      }


It seems like we also want some sort of can_proxy_chunked_read() helper that looks to see if the vector is composed solely of operations that can be translated across chunks (attrs ops and extent ops, but not cls or omap stuff). If we !can_proxy_chunked_read() then we need to trigger a promote.

Once we consider the promote case, we need to figure out how to trigger the ref-counting operations that happen when we clear a manifest. This will require a bit of thinking. Maybe pool properties indicating that references to objects in that pool need to be send a refcount.release op (or whatever)? That would also apply to a simple redirect manifest if the target pool has that property, I would imagine?

liewegas · 2017-06-05T13:55:15Z

My suggestion for now is to

update the manifest types (as you do)
focus just on proxy read for chunked ops, with helper
put in the fallback to a promote

and then let's sort out

how to define the chunk types (more clearly CLEAN, DIRTY, or consider alternatives)
how to handle ref cleanup (e.g., when we promote or when we delete the object with the manifest)
if/how to handle a chunked object where some chunks are clean and others are stored on the local object.

myoungwon · 2017-06-06T16:35:45Z

@liewegas I updated the source code as you suggested. If you agree with the code, I will sort out the chunk types, ref cleanup and handling a chunked object as you mentioned.

jcsp · 2017-06-06T16:55:47Z

@liewegas @myoungwon yes, making the MDS use its GID when talking to the OSDs is 👍 . Because the cephfs clients still expect to talk to the mds by rank, it means splitting the messengers out.

I have a branch from a while back: https://github.com/jcsp/ceph/tree/wip-15399-twomsg -- I can't remember what state its in or if it even compiles.

The other reason we need this is to enable two MDS daemons from different filesystems (which may have the same rank) to use the same pool, so it's a real bonus if we can make this change.

liewegas · 2017-06-06T17:30:28Z

src/osd/osd_types.cc

+    redirect_target.dump(f);
+    f->close_section();
+  }
+  {


let's do something a more strongly typed structure, like

if (type == chunk) { open_array_section("chunk_map"); for (auto& p : chunk_map) { open_object_section("chunk"); dump_unsigned("offset", p.first); p.second.dump(f); close_section(); } close_section();

liewegas · 2017-06-06T17:33:17Z

src/osd/osd_types.h

+struct chunk_info_t {
+  enum {
+    CHUNK_CLEAN = 0,
+    CHUNK_DIRTY = 1, 


if these are flags, we just need FLAG_DIRTY = 1, and the lack of that flag means clean

liewegas · 2017-06-06T17:33:23Z

src/osd/osd_types.h

+  };
+  uint64_t length;
+  hobject_t oid;
+  uint64_t flags;   // dirty, etc.


liewegas · 2017-06-06T17:34:25Z

src/osd/osd_types.h

+    case CHUNK_CLEAN: return "clean";
+    case CHUNK_DIRTY: return "dirty";
+    default: return "unknown";
+    }


string r; if (flags & FLAG_DIRTY) { if (!r.empty()) r += "|"; r += "dirty"; } }

liewegas · 2017-06-06T17:35:55Z

src/osd/osd_types.cc

@@ -4974,7 +5028,13 @@ void object_manifest_t::generate_test_instances(list<object_manifest_t*>& o)

 ostream& operator<<(ostream& out, const object_manifest_t& om)
 {
-  return out << "type:" << om.type << " redirect_target:" << om.redirect_target;
+  out << "type:" << om.type << " redirect_target:" << om.redirect_target;


out << "manifest(" << om.get_type_name(); if (om.is_redirect()) { out << " " << om.redirect_target; } else if (om.is_chunked()) { out << " " << om.chunk_map; } out << ")"; }

liewegas · 2017-06-06T17:36:23Z

src/osd/osd_types.cc

+ostream& operator<<(ostream& out, const chunk_info_t& ci)
+{
+  return out << "length: " << ci.length << " oid: " << ci.oid 
+	     << " flags: " << ci.get_flag_string(ci.flags);


return out << "(len: " << .... << ")";

liewegas · 2017-06-06T17:38:40Z

Let's proceed!

myoungwon · 2017-06-08T13:38:35Z

@liewegas

how to define the chunk types (more clearly CLEAN, DIRTY, or consider alternatives)

I think we need to define at least three types.
a. CLEAN: the local data (base pool) and the remote data (cas pool) are equal
b. DIRTY: the data in the local is different from that in the remote. (the local data is modified but it is not flushed yet)
c. HAS_FINGERPRINT: fingerprint is generated.
For example (deduplication),
a. Clean --> (if the write operation is full write, we can generate fingerprint) DIRTY & HAS_FINGERPRINT --> Flush --> Clean
b. Clean --> (if the write operation is partial write) DIRTY --> (Read full data and then generate the fingerprint) DIRTY & HAS_FINGERPRINT --> Flush --> Clean

how to handle ref cleanup (e.g., when we promote or when we delete the object with the manifest)

Your suggestion (if pool property is set, need to be send a release op) make sense.
Do we need to define a new OP (CEPH_OSD_OP_REFERENCE) and define reference count in xattr or object_info_t as below examples ?
SetRedirect or SetChunk --> send a new OP to remote object. (inc refcount)
Unset* --> send a new OP to remote object (dec refcount, if refcount==0, the data will be delete)
Or just send delete message to the target object when we clear a manifest ?

if/how to handle a chunked object where some chunks are clean and others are stored on the local object.

My suggestion is a background thread. We need a simple offline agent for deduplication eventually. This process can find out dirty chunks and then flush the chunk to the remote pool.

myoungwon · 2017-06-09T18:07:18Z

@liewegas If a background thread is the right way in order to flush dirty objects, I think existing tier agent can be reused because the tier agent already do similar jobs. Do we need a separated thread?

liewegas · 2017-06-09T18:09:56Z

I would re-use the agent thread. But.. the current agent does everything inline in a hard-coded way. We'd prefer to move to a model where we just submit rados ops; that will simplify things like qos and throttling as we move forward.

myoungwon · 2017-06-09T18:40:15Z

@liewegas If so, the simplest way is "Flush op" for manifest (a new rados op). What do you think?

myoungwon · 2017-06-14T14:20:48Z

@liewegas Can you give me some feedback ? (About How to handle dirty chunks and Ref cleanup)

liewegas · 2017-06-16T19:57:05Z

src/osd/PrimaryLogPG.cc

+    if (can_proxy_chunked_read(op)) {
+      do_proxy_chunked_op(op, obc->obs.oi.soid, obc, write_ordered);
+      return cache_result_t::HANDLED_PROXY;
+    }


I think the can_proxy block should go to the top.. and if we can't proxy we must promote. right?

can_proxy_chunked_read() just checks whether OPs (such as CEPH_OSD_OP_READ..)
can be handled or not. If you look at the code above, you can see if (oi.size == 0) { promote_object() }
oi.size == 0 indicate that the object need to be promoted (== can't proxy).

The order of these should at least be reversed, right?

If we can proxy the read, so that.
Otherwise, if we don't have the object locally, promote.

@liewegas Yes, this is reversed. In previous code, proxy_chunked_read() needs oi.size in order to pre-allocate bufferlist for read. So promote_object() is needed. I fixed this as you commented (object size will be set from chunk_map in manifest).

liewegas · 2017-06-16T19:57:47Z

src/osd/PrimaryLogPG.cc

@@ -2344,6 +2344,17 @@ PrimaryLogPG::cache_result_t PrimaryLogPG::maybe_handle_manifest_detail(
    }
    return cache_result_t::HANDLED_PROXY;
  case object_manifest_t::TYPE_CHUNKED:
+    if (obc->obs.oi.size == 0) {


i think this should be if size > 0?

size == 0 means that this object is flushed (only remote copy). So we need to promote.

liewegas · 2017-06-16T20:09:39Z

a. CLEAN: the local data (base pool) and the remote data (cas pool) are equal

I would expect that normally, after we flush, we would zero or truncate off the local data. But on promote, we might recopy the data local and not want to delete/derefence the chunk (yet) unless/until we overwrite it or something. Which means i think there are more states:

CLEAN: local copy and remote copy match
DIRTY: local copy is up to date, remote copy is stale (or does not exist.. depending on whether remote oid is hobject_t() or an actual object)
MISSING: no local copy, only remote copy

I'm not sure the HAS_FINGERPRINT state is necessary. Is it worth writing down the fingerprint before we actually write out the chunk? It seems like that can be in-memory stale until the remote object is written/referenced, and then we record it. After a crash we might recalculate the hash but that is a small price to pay for less complexity (and less IO to write down the intermediate state).

So for dedup, the sequence would be

no manifest (object is local)
we decide there are 2 chunks
we write the first chunk, new manifest written with a CLEAN and DIRTY chunk (or the CLEAN chunk is MISSING and we zero out that range of the object)
we write the second chunk, local object truncated to 0, new manifest written with two MISSING chunks

The third step could also skip writing down the manifest and just wait for all chunks to flush. It makes a broader window for a crash that leaves a dangling ref, but generates less metadata IO.

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

myoungwon · 2017-11-14T17:12:49Z

@liewegas ping?

liewegas · 2017-11-16T20:37:49Z

@myoungwon sorry for the slow response! Just spoke with @jdurgin about this. The concern has been that adding for tiering code to the OSD may make the performance refactoring more difficult. We think we should go ahead with this work, though, with the understanding that it is new/experimental and if it causes problems down the line we may have to disable it until it can be fixed. Hopefully that won't happen, but it would not be surprising to have to rework much of the proxied read/write code.

Running this through the qa suite one more time before merge!

myoungwon · 2017-11-17T10:37:18Z

I agree with you. Experimental feature need to be disable until it can be stable.

liewegas · 2017-11-18T15:47:03Z

bunch of failure son http://pulpito.ceph.com/sage-2017-11-17_20:41:49-rados-wip-sage-testing-2017-11-16-1438-distro-basic-smithi/, removing this pr to see if that is the problem

This commit prevents promote_object() if object need to be blocked Signed-off-by: Myoungwon Oh <omwmw@sk.com>

myoungwon · 2017-11-20T08:50:26Z

Bunch of failure occur due to following reason.

Object is degraded (has missing)
OSD receives read op for chunked object
Chunked object has missing chunk, so promote_object is invoked to process read op
promote_object causes write operations
(It seems that this causes object's unwanted missing state)

Therefore, I pushed a commit that can prevent promote_object when the object is degraded.

myoungwon · 2017-11-20T17:34:21Z

http://pulpito.ceph.com/myoungwon-2017-11-20_11:25:57-rados:thrash-wip-chunked-manifest-distro-basic-smithi/
Most failure cases seem unrelated to this PR.

http://pulpito.ceph.com/myoungwon-2017-11-20_11:25:57-rados:thrash-wip-chunked-manifest-distro-basic-smithi/1869282/
http://pulpito.ceph.com/myoungwon-2017-11-20_11:25:57-rados:thrash-wip-chunked-manifest-distro-basic-smithi/1869426/

These two errors seem like a bug that existed before. (http://tracker.ceph.com/issues/21823#change-102372)

referecnce leak occur if sub_cop has ObjectContextRef Signed-off-by: Myoungwon Oh <omwmw@sk.com>

myoungwon · 2017-11-23T00:06:42Z

retest this please

myoungwon · 2017-11-23T04:32:23Z

@liewegas I re-fixed above two error cases. Ready to retest. (http://pulpito.ceph.com/myoungwon-2017-11-23_02:21:50-rados:thrash-wip-chunked-manifest-distro-basic-smithi/). The error was caused by the reference leak. Sorry for my previous misunderstanding.

tchaikov · 2017-11-24T15:50:42Z

myoungwon · 2017-11-29T23:56:58Z

@liewegas I think test results look good. Can you take a look?

liewegas reviewed Jun 5, 2017

View reviewed changes

liewegas added core feature labels Jun 5, 2017

myoungwon force-pushed the wip-chunked-manifest branch 3 times, most recently from 956ee20 to 13a5504 Compare June 6, 2017 16:29

liewegas reviewed Jun 6, 2017

View reviewed changes

src/osd/osd_types.h Outdated

};

uint64_t length;

hobject_t oid;

uint64_t flags; // dirty, etc.

Copy link

Member

liewegas Jun 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// FLAG_*

liewegas reviewed Jun 6, 2017

View reviewed changes

myoungwon force-pushed the wip-chunked-manifest branch from 13a5504 to ba776e1 Compare June 7, 2017 07:50

liewegas reviewed Jun 16, 2017

View reviewed changes

myoungwon force-pushed the wip-chunked-manifest branch from 188003d to 93be6f7 Compare November 6, 2017 06:48

myoungwon added 3 commits November 6, 2017 15:53

src/test: clean up (set_redirect)

61fd35f

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

src/test: add ChunkRead, SetChunk test

54c09dd

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

qa/suites/rados/thrash: add set_chunk test case

93be6f7

Signed-off-by: Myoungwon Oh <omwmw@sk.com>

liewegas added the wip-sage-testing label Nov 16, 2017

liewegas removed the wip-sage-testing label Nov 19, 2017

osd: blocking read op if object is degraded

7353155

This commit prevents promote_object() if object need to be blocked Signed-off-by: Myoungwon Oh <omwmw@sk.com>

myoungwon changed the title ~~WIP osd,librados: add manifest, operations for chunked object~~ osd,librados: add manifest, operations for chunked object Nov 21, 2017

myoungwon force-pushed the wip-chunked-manifest branch from 42d0955 to 09c0348 Compare November 22, 2017 16:37

osd: fix untracked ObjectContextRef

09c0348

referecnce leak occur if sub_cop has ObjectContextRef Signed-off-by: Myoungwon Oh <omwmw@sk.com>

tchaikov added the wip-kefu-testing label Nov 23, 2017

tchaikov added needs-review and removed needs-qa wip-kefu-testing labels Nov 24, 2017

liewegas approved these changes Nov 30, 2017

View reviewed changes

liewegas added needs-qa and removed needs-qa labels Nov 30, 2017

liewegas merged commit dda79ad into ceph:master Nov 30, 2017

myoungwon mentioned this pull request Dec 2, 2017

osd: flush operations for chunked objects #19294

Merged

myoungwon mentioned this pull request Jan 12, 2018

osd: refcount for manifest object (redirect, chunked) #19935

Merged

osd,librados: add manifest, operations for chunked object #15482

osd,librados: add manifest, operations for chunked object #15482

Conversation

myoungwon commented Jun 5, 2017

myoungwon commented Jun 5, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

myoungwon Jun 6, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Jun 5, 2017

myoungwon commented Jun 6, 2017

jcsp commented Jun 6, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Jun 6, 2017 via email

myoungwon commented Jun 8, 2017 • edited

myoungwon commented Jun 9, 2017 • edited

liewegas commented Jun 9, 2017 via email

myoungwon commented Jun 9, 2017

myoungwon commented Jun 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Jun 16, 2017

myoungwon commented Nov 14, 2017

liewegas commented Nov 16, 2017

myoungwon commented Nov 17, 2017

liewegas commented Nov 18, 2017

myoungwon commented Nov 20, 2017

myoungwon commented Nov 20, 2017

myoungwon commented Nov 23, 2017

myoungwon commented Nov 23, 2017

tchaikov commented Nov 24, 2017

myoungwon commented Nov 29, 2017

myoungwon Jun 6, 2017 •

edited

myoungwon commented Jun 8, 2017 •

edited

myoungwon commented Jun 9, 2017 •

edited