Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd,librados: add manifest, operations for chunked object #15482

Merged
merged 9 commits into from Nov 30, 2017

Conversation

myoungwon
Copy link
Member

As discussed with @liewegas, These commits are the second stage (chunked manifest) for deduplication
(http://pad.ceph.com/p/deduplication_how_dedup_manifists,
http://pad.ceph.com/p/deduplication_how_do_we_store_chunk)

Signed-off-by: Myoungwon Oh omwmw@sk.com

@myoungwon
Copy link
Member Author

@liewegas Could you review these commits?
This can be passed following test.

rados -p rbd set-chunk chunk_test(object name) --target-pool sds-hot 131072
rados -p rbd put chunk_test ./test1
rados -p rbd get chunk_test ./test2
diff test1 test2

But some operations such as copy_from are difficult to handle proxy read/write for the chunked object. Therefore, I think the writeback(dedup) process (the base pool absorbs all writes and this process convert data to chunked data.) is needed in order to support such operation. (http://pad.ceph.com/p/deduplication_how_to_drive_dedup_process).

What do you think? Do we need to support all of the operations at this stage?

oi.manifest.chunk_map[cursor] = chunk_info;
}

oi.set_flag(object_info_t::FLAG_MANIFEST);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have something like

if (!oi.manifest.is_chunked()) {
  oi.manifest.clear();
}

so that the redirect target is cleared if it was a redirect (or any other future stuff that goes into the manifest is cleaned up).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

<< " req_length: " << req_length << dendl;

osd_reqid_t reqid = m->get_reqid();
reqid.inc+=req_offset;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work. inc == incarnation, and it's there because the mds entity_name_t is something like mds.0, mds.1, and it's re-used after failover (with inc changing). We could change that so that the requests come from mds.$gid (which is unique) instead of mds.$rank), and then repurpose the inc field for this.. that is probably my vote, actually. @jcsp does that work for you?

Copy link
Member Author

@myoungwon myoungwon Jun 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas Right. Anyway, we need to create a unique osd_reqid_t in order to avoid dup op in do_op().
I will investigate this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest using inc = original reqid.inc + chunk_index + 1 and we'll resolve the mds use of inc separately

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. Fixed.

SnapContext snapc(m->get_snap_seq(), m->get_snaps());
object_manifest_t *manifest = &obc->obs.oi.manifest;
uint64_t chunk_length = manifest->chunk_map[0].length;
uint64_t chunk_index = req_offset - (req_offset % chunk_length);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is baking in teh assumption that chunks are fixed-size. Since chunk_map is actually a map, can't we just iterate over the map and send a request for each chunk, with no assumptions about the chunk sizes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Fixed.

chunk_name = chunk_name + "_" + to_string(chunk_index);
manifest->chunk_map[chunk_index] = manifest->chunk_map[0];
manifest->chunk_map[chunk_index].oid.oid = object_t(chunk_name);
manifest->chunk_map[chunk_index].flags = chunk_info_t::CHUNK_CLEAN;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is baking in a specific chunking strategy for arbitrary writes. I'm not sure we should do that yet until we have some idea how to describe that policy?

I wonder if, instead, we want some clear to indicate that this particular chunk is local and 'dirty'. Which makes me question what the CLEAN and DIRTY flags really mean; if a particular chunk is DIRTY that means the local copy is newer, in which case does it even make sense to say that there is a remote copy that is stale? Maybe for deferred cleanup of the ref or something?) If not, we really have two types of chunks, more like LOCAL and REMOTE?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either way, for a minimal next step, maybe we simply implement the chunked read and ignore the chunked write case? A simplistic approach would be for a write to a chunked object to trigger a promote and then a local write.

case CEPH_OSD_OP_CHECKSUM:
op.flags = (op.flags | CEPH_OSD_OP_FLAG_FADVISE_SEQUENTIAL) &
~(CEPH_OSD_OP_FLAG_FADVISE_DONTNEED | CEPH_OSD_OP_FLAG_FADVISE_NOCACHE);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we also want some sort of can_proxy_chunked_read() helper that looks to see if the vector is composed solely of operations that can be translated across chunks (attrs ops and extent ops, but not cls or omap stuff). If we !can_proxy_chunked_read() then we need to trigger a promote.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we consider the promote case, we need to figure out how to trigger the ref-counting operations that happen when we clear a manifest. This will require a bit of thinking. Maybe pool properties indicating that references to objects in that pool need to be send a refcount.release op (or whatever)? That would also apply to a simple redirect manifest if the target pool has that property, I would imagine?

@liewegas
Copy link
Member

liewegas commented Jun 5, 2017

My suggestion for now is to

  1. update the manifest types (as you do)
  2. focus just on proxy read for chunked ops, with helper
  3. put in the fallback to a promote

and then let's sort out

  1. how to define the chunk types (more clearly CLEAN, DIRTY, or consider alternatives)
  2. how to handle ref cleanup (e.g., when we promote or when we delete the object with the manifest)
  3. if/how to handle a chunked object where some chunks are clean and others are stored on the local object.

@myoungwon myoungwon force-pushed the wip-chunked-manifest branch 3 times, most recently from 956ee20 to 13a5504 Compare June 6, 2017 16:29
@myoungwon
Copy link
Member Author

@liewegas I updated the source code as you suggested. If you agree with the code, I will sort out the chunk types, ref cleanup and handling a chunked object as you mentioned.

@jcsp
Copy link
Contributor

jcsp commented Jun 6, 2017

@liewegas @myoungwon yes, making the MDS use its GID when talking to the OSDs is 👍 . Because the cephfs clients still expect to talk to the mds by rank, it means splitting the messengers out.

I have a branch from a while back: https://github.com/jcsp/ceph/tree/wip-15399-twomsg -- I can't remember what state its in or if it even compiles.

The other reason we need this is to enable two MDS daemons from different filesystems (which may have the same rank) to use the same pool, so it's a real bonus if we can make this change.

redirect_target.dump(f);
f->close_section();
}
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do something a more strongly typed structure, like

if (type == chunk) {
open_array_section("chunk_map");
for (auto& p : chunk_map) {
  open_object_section("chunk");
  dump_unsigned("offset", p.first);
  p.second.dump(f);
  close_section();
}
close_section();

struct chunk_info_t {
enum {
CHUNK_CLEAN = 0,
CHUNK_DIRTY = 1,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if these are flags, we just need FLAG_DIRTY = 1, and the lack of that flag means clean

};
uint64_t length;
hobject_t oid;
uint64_t flags; // dirty, etc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// FLAG_*

case CHUNK_CLEAN: return "clean";
case CHUNK_DIRTY: return "dirty";
default: return "unknown";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

string r;
if (flags & FLAG_DIRTY) {
if (!r.empty())
 r += "|";
 r += "dirty";
}
}

@@ -4974,7 +5028,13 @@ void object_manifest_t::generate_test_instances(list<object_manifest_t*>& o)

ostream& operator<<(ostream& out, const object_manifest_t& om)
{
return out << "type:" << om.type << " redirect_target:" << om.redirect_target;
out << "type:" << om.type << " redirect_target:" << om.redirect_target;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

out << "manifest(" << om.get_type_name();
if (om.is_redirect()) {
  out << " " << om.redirect_target;
}
else if (om.is_chunked()) {
  out << " " << om.chunk_map;
}
out << ")";
}

ostream& operator<<(ostream& out, const chunk_info_t& ci)
{
return out << "length: " << ci.length << " oid: " << ci.oid
<< " flags: " << ci.get_flag_string(ci.flags);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return out << "(len: " << .... << ")";

@liewegas
Copy link
Member

liewegas commented Jun 6, 2017 via email

@myoungwon
Copy link
Member Author

myoungwon commented Jun 8, 2017

@liewegas

  1. how to define the chunk types (more clearly CLEAN, DIRTY, or consider alternatives)
  • I think we need to define at least three types.
    a. CLEAN: the local data (base pool) and the remote data (cas pool) are equal
    b. DIRTY: the data in the local is different from that in the remote. (the local data is modified but it is not flushed yet)
    c. HAS_FINGERPRINT: fingerprint is generated.

  • For example (deduplication),
    a. Clean --> (if the write operation is full write, we can generate fingerprint) DIRTY & HAS_FINGERPRINT --> Flush --> Clean
    b. Clean --> (if the write operation is partial write) DIRTY --> (Read full data and then generate the fingerprint) DIRTY & HAS_FINGERPRINT --> Flush --> Clean

  1. how to handle ref cleanup (e.g., when we promote or when we delete the object with the manifest)
  • Your suggestion (if pool property is set, need to be send a release op) make sense.
    Do we need to define a new OP (CEPH_OSD_OP_REFERENCE) and define reference count in xattr or object_info_t as below examples ?
    SetRedirect or SetChunk --> send a new OP to remote object. (inc refcount)
    Unset* --> send a new OP to remote object (dec refcount, if refcount==0, the data will be delete)
    Or just send delete message to the target object when we clear a manifest ?
  1. if/how to handle a chunked object where some chunks are clean and others are stored on the local object.
  • My suggestion is a background thread. We need a simple offline agent for deduplication eventually. This process can find out dirty chunks and then flush the chunk to the remote pool.

@myoungwon
Copy link
Member Author

myoungwon commented Jun 9, 2017

@liewegas If a background thread is the right way in order to flush dirty objects, I think existing tier agent can be reused because the tier agent already do similar jobs. Do we need a separated thread?

@liewegas
Copy link
Member

liewegas commented Jun 9, 2017 via email

@myoungwon
Copy link
Member Author

@liewegas If so, the simplest way is "Flush op" for manifest (a new rados op). What do you think?

@myoungwon
Copy link
Member Author

@liewegas Can you give me some feedback ? (About How to handle dirty chunks and Ref cleanup)

if (can_proxy_chunked_read(op)) {
do_proxy_chunked_op(op, obc->obs.oi.soid, obc, write_ordered);
return cache_result_t::HANDLED_PROXY;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the can_proxy block should go to the top.. and if we can't proxy we must promote. right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can_proxy_chunked_read() just checks whether OPs (such as CEPH_OSD_OP_READ..)
can be handled or not. If you look at the code above, you can see if (oi.size == 0) { promote_object() }
oi.size == 0 indicate that the object need to be promoted (== can't proxy).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of these should at least be reversed, right?

If we can proxy the read, so that.
Otherwise, if we don't have the object locally, promote.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas Yes, this is reversed. In previous code, proxy_chunked_read() needs oi.size in order to pre-allocate bufferlist for read. So promote_object() is needed. I fixed this as you commented (object size will be set from chunk_map in manifest).

@@ -2344,6 +2344,17 @@ PrimaryLogPG::cache_result_t PrimaryLogPG::maybe_handle_manifest_detail(
}
return cache_result_t::HANDLED_PROXY;
case object_manifest_t::TYPE_CHUNKED:
if (obc->obs.oi.size == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this should be if size > 0?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

size == 0 means that this object is flushed (only remote copy). So we need to promote.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see

@liewegas
Copy link
Member

a. CLEAN: the local data (base pool) and the remote data (cas pool) are equal

I would expect that normally, after we flush, we would zero or truncate off the local data. But on promote, we might recopy the data local and not want to delete/derefence the chunk (yet) unless/until we overwrite it or something. Which means i think there are more states:

  • CLEAN: local copy and remote copy match
  • DIRTY: local copy is up to date, remote copy is stale (or does not exist.. depending on whether remote oid is hobject_t() or an actual object)
  • MISSING: no local copy, only remote copy

I'm not sure the HAS_FINGERPRINT state is necessary. Is it worth writing down the fingerprint before we actually write out the chunk? It seems like that can be in-memory stale until the remote object is written/referenced, and then we record it. After a crash we might recalculate the hash but that is a small price to pay for less complexity (and less IO to write down the intermediate state).

So for dedup, the sequence would be

  • no manifest (object is local)
  • we decide there are 2 chunks
  • we write the first chunk, new manifest written with a CLEAN and DIRTY chunk (or the CLEAN chunk is MISSING and we zero out that range of the object)
  • we write the second chunk, local object truncated to 0, new manifest written with two MISSING chunks

The third step could also skip writing down the manifest and just wait for all chunks to flush. It makes a broader window for a crash that leaves a dangling ref, but generates less metadata IO.

Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
Signed-off-by: Myoungwon Oh <omwmw@sk.com>
@myoungwon
Copy link
Member Author

@liewegas ping?

@liewegas
Copy link
Member

@myoungwon sorry for the slow response! Just spoke with @jdurgin about this. The concern has been that adding for tiering code to the OSD may make the performance refactoring more difficult. We think we should go ahead with this work, though, with the understanding that it is new/experimental and if it causes problems down the line we may have to disable it until it can be fixed. Hopefully that won't happen, but it would not be surprising to have to rework much of the proxied read/write code.

Running this through the qa suite one more time before merge!

@myoungwon
Copy link
Member Author

I agree with you. Experimental feature need to be disable until it can be stable.

@liewegas
Copy link
Member

bunch of failure son http://pulpito.ceph.com/sage-2017-11-17_20:41:49-rados-wip-sage-testing-2017-11-16-1438-distro-basic-smithi/, removing this pr to see if that is the problem

This commit prevents promote_object() if object need to be blocked

Signed-off-by: Myoungwon Oh <omwmw@sk.com>
@myoungwon
Copy link
Member Author

Bunch of failure occur due to following reason.

  1. Object is degraded (has missing)
  2. OSD receives read op for chunked object
  3. Chunked object has missing chunk, so promote_object is invoked to process read op
  4. promote_object causes write operations
    (It seems that this causes object's unwanted missing state)

Therefore, I pushed a commit that can prevent promote_object when the object is degraded.

@myoungwon myoungwon changed the title WIP osd,librados: add manifest, operations for chunked object osd,librados: add manifest, operations for chunked object Nov 21, 2017
referecnce leak occur if sub_cop has ObjectContextRef

Signed-off-by: Myoungwon Oh <omwmw@sk.com>
@myoungwon
Copy link
Member Author

retest this please

@myoungwon
Copy link
Member Author

@liewegas I re-fixed above two error cases. Ready to retest. (http://pulpito.ceph.com/myoungwon-2017-11-23_02:21:50-rados:thrash-wip-chunked-manifest-distro-basic-smithi/). The error was caused by the reference leak. Sorry for my previous misunderstanding.

@myoungwon
Copy link
Member Author

@liewegas I think test results look good. Can you take a look?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants