New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd,librados: add manifest, operations for chunked object #15482
Conversation
@liewegas Could you review these commits? rados -p rbd set-chunk chunk_test(object name) --target-pool sds-hot 131072 But some operations such as copy_from are difficult to handle proxy read/write for the chunked object. Therefore, I think the writeback(dedup) process (the base pool absorbs all writes and this process convert data to chunked data.) is needed in order to support such operation. (http://pad.ceph.com/p/deduplication_how_to_drive_dedup_process). What do you think? Do we need to support all of the operations at this stage? |
src/osd/PrimaryLogPG.cc
Outdated
oi.manifest.chunk_map[cursor] = chunk_info; | ||
} | ||
|
||
oi.set_flag(object_info_t::FLAG_MANIFEST); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should have something like
if (!oi.manifest.is_chunked()) { oi.manifest.clear(); }
so that the redirect target is cleared if it was a redirect (or any other future stuff that goes into the manifest is cleaned up).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
src/osd/PrimaryLogPG.cc
Outdated
<< " req_length: " << req_length << dendl; | ||
|
||
osd_reqid_t reqid = m->get_reqid(); | ||
reqid.inc+=req_offset; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will work. inc == incarnation, and it's there because the mds entity_name_t is something like mds.0, mds.1, and it's re-used after failover (with inc changing). We could change that so that the requests come from mds.$gid (which is unique) instead of mds.$rank), and then repurpose the inc field for this.. that is probably my vote, actually. @jcsp does that work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas Right. Anyway, we need to create a unique osd_reqid_t in order to avoid dup op in do_op().
I will investigate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest using inc = original reqid.inc + chunk_index + 1 and we'll resolve the mds use of inc separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestion. Fixed.
src/osd/PrimaryLogPG.cc
Outdated
SnapContext snapc(m->get_snap_seq(), m->get_snaps()); | ||
object_manifest_t *manifest = &obc->obs.oi.manifest; | ||
uint64_t chunk_length = manifest->chunk_map[0].length; | ||
uint64_t chunk_index = req_offset - (req_offset % chunk_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is baking in teh assumption that chunks are fixed-size. Since chunk_map is actually a map, can't we just iterate over the map and send a request for each chunk, with no assumptions about the chunk sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Fixed.
src/osd/PrimaryLogPG.cc
Outdated
chunk_name = chunk_name + "_" + to_string(chunk_index); | ||
manifest->chunk_map[chunk_index] = manifest->chunk_map[0]; | ||
manifest->chunk_map[chunk_index].oid.oid = object_t(chunk_name); | ||
manifest->chunk_map[chunk_index].flags = chunk_info_t::CHUNK_CLEAN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is baking in a specific chunking strategy for arbitrary writes. I'm not sure we should do that yet until we have some idea how to describe that policy?
I wonder if, instead, we want some clear to indicate that this particular chunk is local and 'dirty'. Which makes me question what the CLEAN and DIRTY flags really mean; if a particular chunk is DIRTY that means the local copy is newer, in which case does it even make sense to say that there is a remote copy that is stale? Maybe for deferred cleanup of the ref or something?) If not, we really have two types of chunks, more like LOCAL and REMOTE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either way, for a minimal next step, maybe we simply implement the chunked read and ignore the chunked write case? A simplistic approach would be for a write to a chunked object to trigger a promote and then a local write.
src/osd/PrimaryLogPG.cc
Outdated
case CEPH_OSD_OP_CHECKSUM: | ||
op.flags = (op.flags | CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL) & | ||
~(CEPH_OSD_OP_FLAG_FADVISE_DONTNEED | CEPH_OSD_OP_FLAG_FADVISE_NOCACHE); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we also want some sort of can_proxy_chunked_read() helper that looks to see if the vector is composed solely of operations that can be translated across chunks (attrs ops and extent ops, but not cls or omap stuff). If we !can_proxy_chunked_read() then we need to trigger a promote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we consider the promote case, we need to figure out how to trigger the ref-counting operations that happen when we clear a manifest. This will require a bit of thinking. Maybe pool properties indicating that references to objects in that pool need to be send a refcount.release op (or whatever)? That would also apply to a simple redirect manifest if the target pool has that property, I would imagine?
My suggestion for now is to
and then let's sort out
|
956ee20
to
13a5504
Compare
@liewegas I updated the source code as you suggested. If you agree with the code, I will sort out the chunk types, ref cleanup and handling a chunked object as you mentioned. |
@liewegas @myoungwon yes, making the MDS use its GID when talking to the OSDs is 👍 . Because the cephfs clients still expect to talk to the mds by rank, it means splitting the messengers out. I have a branch from a while back: https://github.com/jcsp/ceph/tree/wip-15399-twomsg -- I can't remember what state its in or if it even compiles. The other reason we need this is to enable two MDS daemons from different filesystems (which may have the same rank) to use the same pool, so it's a real bonus if we can make this change. |
src/osd/osd_types.cc
Outdated
redirect_target.dump(f); | ||
f->close_section(); | ||
} | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do something a more strongly typed structure, like
if (type == chunk) { open_array_section("chunk_map"); for (auto& p : chunk_map) { open_object_section("chunk"); dump_unsigned("offset", p.first); p.second.dump(f); close_section(); } close_section();
src/osd/osd_types.h
Outdated
struct chunk_info_t { | ||
enum { | ||
CHUNK_CLEAN = 0, | ||
CHUNK_DIRTY = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if these are flags, we just need FLAG_DIRTY = 1, and the lack of that flag means clean
src/osd/osd_types.h
Outdated
}; | ||
uint64_t length; | ||
hobject_t oid; | ||
uint64_t flags; // dirty, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// FLAG_*
src/osd/osd_types.h
Outdated
case CHUNK_CLEAN: return "clean"; | ||
case CHUNK_DIRTY: return "dirty"; | ||
default: return "unknown"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
string r; if (flags & FLAG_DIRTY) { if (!r.empty()) r += "|"; r += "dirty"; } }
src/osd/osd_types.cc
Outdated
@@ -4974,7 +5028,13 @@ void object_manifest_t::generate_test_instances(list<object_manifest_t*>& o) | |||
|
|||
ostream& operator<<(ostream& out, const object_manifest_t& om) | |||
{ | |||
return out << "type:" << om.type << " redirect_target:" << om.redirect_target; | |||
out << "type:" << om.type << " redirect_target:" << om.redirect_target; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out << "manifest(" << om.get_type_name(); if (om.is_redirect()) { out << " " << om.redirect_target; } else if (om.is_chunked()) { out << " " << om.chunk_map; } out << ")"; }
src/osd/osd_types.cc
Outdated
ostream& operator<<(ostream& out, const chunk_info_t& ci) | ||
{ | ||
return out << "length: " << ci.length << " oid: " << ci.oid | ||
<< " flags: " << ci.get_flag_string(ci.flags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return out << "(len: " << .... << ")";
Let's proceed!
|
13a5504
to
ba776e1
Compare
|
@liewegas If a background thread is the right way in order to flush dirty objects, I think existing tier agent can be reused because the tier agent already do similar jobs. Do we need a separated thread? |
I would re-use the agent thread. But.. the current agent does
everything inline in a hard-coded way. We'd prefer to move to a model
where we just submit rados ops; that will simplify things like qos and
throttling as we move forward.
|
@liewegas If so, the simplest way is "Flush op" for manifest (a new rados op). What do you think? |
@liewegas Can you give me some feedback ? (About How to handle dirty chunks and Ref cleanup) |
src/osd/PrimaryLogPG.cc
Outdated
if (can_proxy_chunked_read(op)) { | ||
do_proxy_chunked_op(op, obc->obs.oi.soid, obc, write_ordered); | ||
return cache_result_t::HANDLED_PROXY; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the can_proxy block should go to the top.. and if we can't proxy we must promote. right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can_proxy_chunked_read() just checks whether OPs (such as CEPH_OSD_OP_READ..)
can be handled or not. If you look at the code above, you can see if (oi.size == 0) { promote_object() }
oi.size == 0 indicate that the object need to be promoted (== can't proxy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of these should at least be reversed, right?
If we can proxy the read, so that.
Otherwise, if we don't have the object locally, promote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas Yes, this is reversed. In previous code, proxy_chunked_read() needs oi.size in order to pre-allocate bufferlist for read. So promote_object() is needed. I fixed this as you commented (object size will be set from chunk_map in manifest).
src/osd/PrimaryLogPG.cc
Outdated
@@ -2344,6 +2344,17 @@ PrimaryLogPG::cache_result_t PrimaryLogPG::maybe_handle_manifest_detail( | |||
} | |||
return cache_result_t::HANDLED_PROXY; | |||
case object_manifest_t::TYPE_CHUNKED: | |||
if (obc->obs.oi.size == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this should be if size > 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size == 0 means that this object is flushed (only remote copy). So we need to promote.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see
I would expect that normally, after we flush, we would zero or truncate off the local data. But on promote, we might recopy the data local and not want to delete/derefence the chunk (yet) unless/until we overwrite it or something. Which means i think there are more states:
I'm not sure the HAS_FINGERPRINT state is necessary. Is it worth writing down the fingerprint before we actually write out the chunk? It seems like that can be in-memory stale until the remote object is written/referenced, and then we record it. After a crash we might recalculate the hash but that is a small price to pay for less complexity (and less IO to write down the intermediate state). So for dedup, the sequence would be
The third step could also skip writing down the manifest and just wait for all chunks to flush. It makes a broader window for a crash that leaves a dangling ref, but generates less metadata IO. |
188003d
to
93be6f7
Compare
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
@liewegas ping? |
@myoungwon sorry for the slow response! Just spoke with @jdurgin about this. The concern has been that adding for tiering code to the OSD may make the performance refactoring more difficult. We think we should go ahead with this work, though, with the understanding that it is new/experimental and if it causes problems down the line we may have to disable it until it can be fixed. Hopefully that won't happen, but it would not be surprising to have to rework much of the proxied read/write code. Running this through the qa suite one more time before merge! |
I agree with you. Experimental feature need to be disable until it can be stable. |
bunch of failure son http://pulpito.ceph.com/sage-2017-11-17_20:41:49-rados-wip-sage-testing-2017-11-16-1438-distro-basic-smithi/, removing this pr to see if that is the problem |
This commit prevents promote_object() if object need to be blocked Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Bunch of failure occur due to following reason.
Therefore, I pushed a commit that can prevent promote_object when the object is degraded. |
http://pulpito.ceph.com/myoungwon-2017-11-20_11:25:57-rados:thrash-wip-chunked-manifest-distro-basic-smithi/ http://pulpito.ceph.com/myoungwon-2017-11-20_11:25:57-rados:thrash-wip-chunked-manifest-distro-basic-smithi/1869282/ These two errors seem like a bug that existed before. (http://tracker.ceph.com/issues/21823#change-102372) |
42d0955
to
09c0348
Compare
referecnce leak occur if sub_cop has ObjectContextRef Signed-off-by: Myoungwon Oh <omwmw@sk.com>
retest this please |
@liewegas I re-fixed above two error cases. Ready to retest. (http://pulpito.ceph.com/myoungwon-2017-11-23_02:21:50-rados:thrash-wip-chunked-manifest-distro-basic-smithi/). The error was caused by the reference leak. Sorry for my previous misunderstanding. |
@liewegas I think test results look good. Can you take a look? |
As discussed with @liewegas, These commits are the second stage (chunked manifest) for deduplication
(http://pad.ceph.com/p/deduplication_how_dedup_manifists,
http://pad.ceph.com/p/deduplication_how_do_we_store_chunk)
Signed-off-by: Myoungwon Oh omwmw@sk.com