Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librbd: generalized deep copy function #16238

Merged
merged 7 commits into from Nov 6, 2017
Merged

Conversation

trociny
Copy link
Contributor

@trociny trociny commented Jul 10, 2017

(mostly taken from rbd-mirror image sync)

Signed-off-by: Mykola Golub mgolub@mirantis.com

@trociny
Copy link
Contributor Author

trociny commented Jul 10, 2017

@dillaman I created this PR at this stage just to make discussion easier.

The current version still needs cleanup and unit tests, and works only if source and destination image object sizes are the same. Now I am trying to generalize it to work when src and dst object sizes differ.

The approach I am trying is to substitute one to one src to dst object mapping to src object extent to dst object extent mapping, and modify ObjectCopyRequest to be actually ObjectExtentCopyRequest, which is called to copy an src object N extent [a, b] to dst object M extent [c, d].

Is this approach that you were thinking about?

It looks like the most complicated part here is to modify ObjectCopyRequest::compute_diffs. I think TRUNC and REMOVE operations that made sense for object to object mapping should be replaced by
DISCARD (ZERO) op now. Then it might lead to empty or not tuncated objects at the destination?

Also, not sure what to do with m_snap_object_states and update_object_map...

@trociny
Copy link
Contributor Author

trociny commented Jul 11, 2017

@dillaman I see now that my "copy extent by extent approach" is actually wrong (due to the problems I described above). It looks like what I need is:

for every dst object

  • map it to the list of src objects (extents)
  • copy snapshots from these src objects to dst object (and build snap_map)
  • copy object extents for every snapshot
  • update dst object state in object map

(actually not this looks to me like what you initially suggested)

return;
}

// rollback the object map (copy snapshot object map to HEAD)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny How good It would be start comment from capital letter?
// Rollback the object map (copy snapshot object map to HEAD)


ldout(m_cct, 20) << dendl;

// Change the image size on disk so that the snapshot picks up

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny I believe multi-line comment should be inside /* */

}

{
// adjust in-memory image size now that it's updated on disk

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny I believe It would be even better if we start comment with capital letter something as:
// Adjust in-memory image size now that it's updated on disk

ldout(m_cct, 20) << "object_map_oid=" << object_map_oid << ", "
<< "object_count=" << object_count << dendl;

// initialize an empty object map of the correct size (object sync

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny multiline comment inside /* */

@trociny
Copy link
Contributor Author

trociny commented Aug 28, 2017

@dillaman Here is updated copy_deep implementation, which supports different source and destination image object sizes. It still needs unit tests though.

Could you please look at the current version, so I could be sure I go in the right direction?

Also, I would like to discuss the scope of this PR. I have added copy_deep internal function, which is currently used only in tests. I plan to use this function in rbd-mirror instead of image_sync. Also, I
think image_copy::ObjectCopyRequest should be integrated (somehow) into io::ObjectRequest to be called for live migrating images. Is this what you was thinking initially? Would just copy_deep internal
function implementation (as it currently is) be enough for this PR?

@trociny
Copy link
Contributor Author

trociny commented Aug 29, 2017

@dillaman I have (temporary) added a commit that makes rbd-mirror to use generalized copy deep. I think it can be pushed as a separated PR later. It is here just to show how I plan to use copy deep.

@@ -34,13 +34,18 @@ set(librbd_internal_srcs
exclusive_lock/StandardPolicy.cc
image/CloneRequest.cc
image/CloseRequest.cc
image/CopyRequest.cc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I would probably move this up to the root and rename to DeepCopyRequest

image/CreateRequest.cc
image/OpenRequest.cc
image/RefreshParentRequest.cc
image/RefreshRequest.cc
image/RemoveRequest.cc
image/SetFlagsRequest.cc
image/SetSnapRequest.cc
image_copy/ImageCopyRequest.cc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: subdirectory should be deep_copy

}
}

// Recalculate for destination

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would the destination size be different on a copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not the destination image size but total number of objects (m_end_object_no) may be different if src and dst object sizes are different.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, but the way that ObjectCopyRequest is currently designed, it assumes that the source and destination have matching layouts (order and striping v2). If we want to support overriding those settings on the destination, which I think is important, that class needs to be redesigned so that you loop over all destination objects and then the ObjectCopyRequest state machine loads diffs for all source objects that overlap w/ the destination object's layout, computes the necessary ops over the overlap, and does the single destination write per snapshot.

Given that, I think you would drop this logic and change the loop above to iterate over the destination image instead of the source image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it is exactly how it works now? I.e. here we calculate number of destination objects and then for every destination object call ObjectCopyRequest. In my modified implementation it expects destination object number as a param, and then it maps dst object to src objects, loads diffs, etc...

But I forgot about striping v2 feature. I need to refresh memory about how it works, but right now I think the code will not work correctly in general case if striping is enabled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that you adjusted it to handle that case. All object and offset calculation should use the Striper logic instead of attempting to re-implement it all over the place. You would use the striper to calculate the image extents covered by the destination object and then for each image extent you would map that back to source object extents.

#include "ObjectCopyRequest.h"
#include "common/errno.h"
#include "librbd/Utils.h"
#include "tools/rbd_mirror/ProgressContext.h"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: yank

#include "osdc/Striper.h"

#define dout_context g_ceph_context
#define dout_subsys ceph_subsys_rbd_mirror

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: adjust

}

template <typename I>
void ImageSync<I>::handle_refresh_object_map(int r) {
dout(20) << dendl;
void ImageSync<I>::handle_unset_sync_point_snap_contex(int r) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typo

* |
* v
* COPY_IMAGE . . . . . . . . . . . . . .
* | .
* v .
* COPY_OBJECT_MAP (skip if object .
* | map disabled) .
* UNSET_SYNC_POINT_SNAP_CONTEX .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typo

@@ -79,18 +76,15 @@ class ImageSync : public BaseRequest {
* CREATE_SYNC_POINT (skip if already exists and
* | not disconnected)
* v
* COPY_SNAPSHOTS
* SET_SYNC_POINT_SNAP_CONTEX

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typo

.WillOnce(Invoke([this, &mock_snapshot_copy_request, r]() {
m_threads->work_queue->queue(mock_snapshot_copy_request.on_finish, r);
}));
void expect_set_sync_point_snap_contex(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typo

.WillOnce(Invoke([this](Context *ctx) {
m_threads->work_queue->queue(ctx, 0);
}));
void expect_unset_sync_point_snap_contex(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: typo

@dillaman
Copy link

dillaman commented Aug 30, 2017

@trociny Looks really good to me -- just nits basically. Might be even clearer history-wise if you just git mv the support classes over to librbd along w/ the associated unit tests

}

// prepare the object map state
// XXXMG: how to support OBJECT_EXISTS_CLEAN?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman What do you think about this? The commented code is from the original image_sync. Now, thinking about this more it looks to me I can safely use the same, i.e. just uncomment this code and remove m_snap_object_states[end_dst_snap_id] = OBJECT_EXISTS.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems OK to me

@trociny
Copy link
Contributor Author

trociny commented Sep 5, 2017

@dillaman updated according to your comments + unit tests added.

Though right now I think I unnecessarily limited DeepCopyRequest interface. It expects src_image_ctx to be set to snap, that is used as snap_id_end, and for snap_id_start we always use 0.

I am planning to generalize DeepCopyRequest to expect copy_point_start and copy_point_end arguments provided.

@trociny trociny changed the title [DNM] librbd: generalized copy deep function librbd: generalized copy deep function Sep 11, 2017
@trociny
Copy link
Contributor Author

trociny commented Sep 11, 2017

@dillaman updated. The commit that makes rbd-mirror to use generalized copy deep is still here to show how deep copy is used. I can push it as a separated PR if it makes reviewing and merging easier.

@trociny trociny changed the title librbd: generalized copy deep function librbd: generalized deep copy function Sep 11, 2017
return -EINVAL;
}

if (m_src_image_ctx->snap_info.find(m_snap_id_end) ==

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the rbd CLI's deep-copy, it might be nice to support copying to the HEAD revision if desired (i.e. if the image is not being actively modified, no need to create a snapshot).

}

template <typename I>
void DeepCopyRequest<I>::send_remove_copy_snapshot() {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this state machine didn't create the snapshot, perhaps it shouldn't be responsible for deleting it -- especially if the goal is to have rbd-mirror use its own snapshot namespace eventually.

void DeepCopyRequest<I>::send_copy_object_map() {
m_dst_image_ctx->owner_lock.get_read();
m_dst_image_ctx->snap_lock.get_read();
if (!m_dst_image_ctx->test_features(RBD_FEATURE_OBJECT_MAP,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... also skip if copying from the HEAD revision since the object copy should have updated the HEAD object map

#include "librbd/deep_copy/SnapshotCopyRequest.h"
#include "librbd/operation/SnapshotRemoveRequest.h"

#define dout_context g_ceph_context

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: yank

#include "librbd/Utils.h"
#include "osdc/Striper.h"

#define dout_context g_ceph_context

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: yank

m_snap_object_sizes = {};
m_src_object_offsets = {};

std::vector<std::pair<uint64_t, uint64_t>> file_extents;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: file -> image


template <typename I>
void ObjectCopyRequest<I>::send_list_snaps() {
// starting copy from highest src object number is important for truncate

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be able to handle a truncate of any source object -- and w/ striping, it's possible that the "end" image extent of the destination object is not align with the highest source object number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a destination object number x is mapped to src object numbers a, b, c (due to different object sizes) I assume that it will be always a < b < c for any striping. Now, if reading src objects I see that c does not exist, and b is truncated, I can truncate the destination object up to b truncated length. If b is truncated but c exists I know that I need to write a hole.

For this reason I start listing src objects from the highest one.

Copy link

@dillaman dillaman Sep 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fancy striping is used, I am not sure that logic is correct:

src:

  • object size 2
  • stripe unit = 1
  • stripe count = 3

dst:

  • object size 4
  • no fancy striping

The mapping would be: object 1@0~1, object 2@0~1, object 3@0~1, object 1@1~1. If object 3 doesn't exist in this case, that's a hole and not a truncate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman It is good to know. Thanks!

bufferlist out_bl;
};

typedef std::map<uint64_t, std::vector<std::pair<uint64_t, size_t>>> ObjectOffsets; // src_object_id -> list of (object_off, len)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ObjectExtents is the terminology used elsewhere

auto len = it.second;

interval_set<uint64_t> copy_interval;
copy_interval.insert(src_object_off, src_object_off + len);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normal interval_set insert logic is value and length not start and end values.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about the complexity of this some more, I wonder if it would make more sense to refactor the CopyOp computation logic and separate out the read and write ops into two different collections. The reads would be ordered by write/read snapshots and source object and the writes could be ordered by write snapshot.

We would change the flow a little so that for each source object, we first list its diff then perform the read. This allows us to avoid repeating all source object reads should a retry be required. Once all the reads have been performed across all source objects, we can perform the write stage.

When computing the diffs, instead of creating remove/zero/truncate ops, we could instead just track an interval set of zeroed locations for the destination object at the current write snapshot. Then as part of the write step, we would generate zero, truncate or remove ops I think pretty easily.

void send_update_object_map();
void handle_update_object_map(int r);

Context *start_dst_op(RWLock &owner_lock);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: start_lock_op or start_destination_op

@trociny
Copy link
Contributor Author

trociny commented Oct 9, 2017

@dillaman updated. Still open questions:

  • Is the recently added MetadataCopyRequest should to be a part of deep_copy or rather a separate request?
  • I don't know how to set OBJECT_EXISTS_CLEAN flag properly (probably because I am not sure I understand this feature).

@dillaman
Copy link

@trociny The MetadataCopyRequest just needs to run once somewhere when doing a deep-copy. I added it to rbd::mirror::ImageSync as the last state since it would only need to execute once in theory (i.e. in case the image copy was canceled). The CLEAN flag should be set when the previous snapshot's object map was OBJECT_EXISTS or OBJECT_EXISTS_CLEAN and there are no diffs in the current snapshot (HEAD) revision.

@trociny
Copy link
Contributor Author

trociny commented Oct 11, 2017

@dillaman updated to deal with OBJECT_EXISTS_CLEAN.

As for MetadataCopyReques, if I understand correctly your idea, I will leave it as is so far, so DeepCopyRequest will not call it. And later, when I need it for migration project (so metadata would be copied on migration too), I can move it to somewhere to librbd (where it could be BTW?).

@dillaman
Copy link

@trociny Sorry -- I was saying that I think the metadata copy should be moved to deep-copy as well. I was just trying to say that you could move it anywhere so long as it executes at the end of the deep-copy.

@trociny
Copy link
Contributor Author

trociny commented Oct 11, 2017

@dillaman Thanks. I will move MetadataCopyReques. Also, my "OBJECT_EXISTS_CLEAN" code still needs some work.

@trociny
Copy link
Contributor Author

trociny commented Oct 12, 2017

@dillaman updated

@trociny
Copy link
Contributor Author

trociny commented Oct 30, 2017

@dillaman I have updated fsx to support deep-copy. I have not tested it well yet, and I still need to enable it in qa. Right now I have a question though. I had to link fsx with internal libraries. Might we want to add deep_copy to API instead?

@dillaman
Copy link

@trociny Yeah -- add it to the API since I can see this being added to the rbd CLI as well in a future PR.

@ceph-jenkins
Copy link
Collaborator

submodules for project are unmodified

@ceph-jenkins
Copy link
Collaborator

all commits in this PR are signed

@ceph-jenkins
Copy link
Collaborator

OK - docs built

@trociny
Copy link
Contributor Author

trociny commented Nov 1, 2017

@dillaman I have added deep_copy to API and enabled fsx qa tests. And observing fsx test failures on teuthology now [1], which I still need to track down (failed to reproduce locally so far, the failed fsx command successfully completes on my "vstart" env).

Meantime I have questions about deep_copy API:

  • In the current version it is possible to set only end snap (by setting src image snap context). May be we want to allow to set start snap too (start_snap_name, end_snap_name params)? It could be used for incremental backup.
  • As it is seen in the fsx case, copying end_snap ("copy" snapshot) to destination is not useful in all cases and will require additional work from the user if she wants a "clone" like behavior. May be we need to improve this?

[1] http://pulpito.ceph.com/trociny-2017-11-01_10:45:27-rbd-wip-mgolub-testing-distro-basic-smithi/

@ceph-jenkins
Copy link
Collaborator

make check succeeded

1 similar comment
@ceph-jenkins
Copy link
Collaborator

make check succeeded

CEPH_RBD_API int rbd_deep_copy(rbd_image_t src, rados_ioctx_t dest_io_ctx,
const char *destname, rbd_image_options_t dest_opts);
CEPH_RBD_API int rbd_deep_copy_with_progress(rbd_image_t image,
rados_ioctx_t dest_p,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: dest_io_ctx?

for (auto &p : src_object_extents) {
for (auto &s : p.second) {
m_src_objects.insert(s.objectno);
m_src_object_extents.push_back({s.objectno, s.offset, s.length});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman I think I have found an issue here.
I need m_src_object_extents list to be ordered by a destination object offset (so I could sequentially write extents to the destination object; alternatively I would need to track src_offset->dst_offset). But I did not realize Striper::file_to_extents returned a map ordered by object and the order is lost. So I need to sort extents returned in src_object_extents.

Right now I don't see a good way. It looks like using Striper::extent_to_file + Striper::file_to_extents like above I can't map dst_object_offset -> src_object, src_object_offset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it looks like I just need to use a different version of Striper::file_to_extents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the version that returns a list, uses assimilate_extents which does the same that my code above, so it returns the list in wrong order (

It looks I would need to extend Striper to provide file_to_extents version I need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman I fixed this by introducing src_to_dst_object_offset() (see the fixup commit 48029c2). Do you think it could be a solution or do we need something more efficient (e.g. update Striper to return results we need)?

After this change fsx still fails but in different place. Need to track this further.

Copy link
Contributor Author

@trociny trociny Nov 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman Another problem with compute_src_object_extents() found: I need to split src object extents, if the destination stripe_unit is smaller than the source stripe_unit. The fixup commit 9024db9 contains the fix for this issue too. Now it is much better: I have not been able to make fsx fail so far!

I will squash the fixup after your review.

@trociny trociny force-pushed the wip-copy-deep branch 2 times, most recently from f1ff5fe to 9024db9 Compare November 3, 2017 21:28
@trociny
Copy link
Contributor Author

trociny commented Nov 4, 2017

assert(s.length >= stripe_unit);
auto dst_object_offset = src_to_dst_object_offset(s.objectno, s.offset);
m_src_object_extents[dst_object_offset] = {s.objectno, s.offset,
stripe_unit};
Copy link

@dillaman dillaman Nov 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if the dst_object_offset is in the middle of a dest stripe? For example, if the source extent was 2~2 (stripe of 2) and the destination extext was 0~3 (stripe of 3), the first byte of the source extent would map to the destination extent 2~1 but the last byte of the min stripe would go somewhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman I am not sure I see exactly what you mean, but taking the condition that is used for stripe_unit on image creation:

(1 << order) % stripe_unit == 0 && stripe_unit <= (1 << order)

which I think can be rewritten as:

stripe_unit == (1 << x), where x <= order

which means that a situation like you described is not possible?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK -- fair enough if odd stripe units are prevented. We have plenty of "fsx" testing time to catch any potential regressions. Go ahead and squash the fixup commit (or just prefix it w/ "librbd") for merging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman thanks. Rebased.

} // namespace librbd

// template definitions
template class librbd::DeepCopyRequest<librbd::MockTestImageCtx>;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no need to explicitly instantiate -- it's causing a compile warning:

In file included from /home/jdillaman/ceph/src/test/librbd/test_mock_DeepCopyRequest.cc:148:0:
/home/jdillaman/ceph/src/librbd/DeepCopyRequest.cc:64:6: warning: ‘void librbd::DeepCopyRequest::cancel() [with ImageCtxT = librbd::{anonymous}::MockTestImageCtx]’ defined but not used [-Wunused-function]
void DeepCopyRequest::cancel() {
^~~~~~~~~~~~~~~~~~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman fixed. Thanks.

Mykola Golub and others added 7 commits November 6, 2017 10:26
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
(so it could be used for logging image options)

Signed-off-by: Mykola Golub <to.my.trociny@gmail.com>
(based on rbd-mirror image sync)

Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <to.my.trociny@gmail.com>
Signed-off-by: Mykola Golub <to.my.trociny@gmail.com>
Signed-off-by: Mykola Golub <to.my.trociny@gmail.com>
Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@dillaman dillaman merged commit baa65c2 into ceph:master Nov 6, 2017
@trociny trociny deleted the wip-copy-deep branch November 6, 2017 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants