New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
librbd: generalized deep copy function #16238
Conversation
@dillaman I created this PR at this stage just to make discussion easier. The current version still needs cleanup and unit tests, and works only if source and destination image object sizes are the same. Now I am trying to generalize it to work when src and dst object sizes differ. The approach I am trying is to substitute one to one src to dst object mapping to src object extent to dst object extent mapping, and modify ObjectCopyRequest to be actually ObjectExtentCopyRequest, which is called to copy an src object N extent [a, b] to dst object M extent [c, d]. Is this approach that you were thinking about? It looks like the most complicated part here is to modify ObjectCopyRequest::compute_diffs. I think TRUNC and REMOVE operations that made sense for object to object mapping should be replaced by Also, not sure what to do with m_snap_object_states and update_object_map... |
@dillaman I see now that my "copy extent by extent approach" is actually wrong (due to the problems I described above). It looks like what I need is: for every dst object
(actually not this looks to me like what you initially suggested) |
src/librbd/image/CopyRequest.cc
Outdated
return; | ||
} | ||
|
||
// rollback the object map (copy snapshot object map to HEAD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trociny How good It would be start comment from capital letter?
// Rollback the object map (copy snapshot object map to HEAD)
|
||
ldout(m_cct, 20) << dendl; | ||
|
||
// Change the image size on disk so that the snapshot picks up |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trociny I believe multi-line comment should be inside /* */
} | ||
|
||
{ | ||
// adjust in-memory image size now that it's updated on disk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trociny I believe It would be even better if we start comment with capital letter something as:
// Adjust in-memory image size now that it's updated on disk
ldout(m_cct, 20) << "object_map_oid=" << object_map_oid << ", " | ||
<< "object_count=" << object_count << dendl; | ||
|
||
// initialize an empty object map of the correct size (object sync |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trociny multiline comment inside /* */
@dillaman Here is updated copy_deep implementation, which supports different source and destination image object sizes. It still needs unit tests though. Could you please look at the current version, so I could be sure I go in the right direction? Also, I would like to discuss the scope of this PR. I have added |
@dillaman I have (temporary) added a commit that makes rbd-mirror to use generalized copy deep. I think it can be pushed as a separated PR later. It is here just to show how I plan to use copy deep. |
src/librbd/CMakeLists.txt
Outdated
@@ -34,13 +34,18 @@ set(librbd_internal_srcs | |||
exclusive_lock/StandardPolicy.cc | |||
image/CloneRequest.cc | |||
image/CloseRequest.cc | |||
image/CopyRequest.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I would probably move this up to the root and rename to DeepCopyRequest
src/librbd/CMakeLists.txt
Outdated
image/CreateRequest.cc | ||
image/OpenRequest.cc | ||
image/RefreshParentRequest.cc | ||
image/RefreshRequest.cc | ||
image/RemoveRequest.cc | ||
image/SetFlagsRequest.cc | ||
image/SetSnapRequest.cc | ||
image_copy/ImageCopyRequest.cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: subdirectory should be deep_copy
} | ||
} | ||
|
||
// Recalculate for destination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would the destination size be different on a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not the destination image size but total number of objects (m_end_object_no
) may be different if src and dst object sizes are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, but the way that ObjectCopyRequest
is currently designed, it assumes that the source and destination have matching layouts (order and striping v2). If we want to support overriding those settings on the destination, which I think is important, that class needs to be redesigned so that you loop over all destination objects and then the ObjectCopyRequest
state machine loads diffs for all source objects that overlap w/ the destination object's layout, computes the necessary ops over the overlap, and does the single destination write per snapshot.
Given that, I think you would drop this logic and change the loop above to iterate over the destination image instead of the source image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose it is exactly how it works now? I.e. here we calculate number of destination objects and then for every destination object call ObjectCopyRequest
. In my modified implementation it expects destination object number as a param, and then it maps dst object to src objects, loads diffs, etc...
But I forgot about striping v2 feature. I need to refresh memory about how it works, but right now I think the code will not work correctly in general case if striping is enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed that you adjusted it to handle that case. All object and offset calculation should use the Striper logic instead of attempting to re-implement it all over the place. You would use the striper to calculate the image extents covered by the destination object and then for each image extent you would map that back to source object extents.
#include "ObjectCopyRequest.h" | ||
#include "common/errno.h" | ||
#include "librbd/Utils.h" | ||
#include "tools/rbd_mirror/ProgressContext.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: yank
#include "osdc/Striper.h" | ||
|
||
#define dout_context g_ceph_context | ||
#define dout_subsys ceph_subsys_rbd_mirror |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: adjust
src/tools/rbd_mirror/ImageSync.cc
Outdated
} | ||
|
||
template <typename I> | ||
void ImageSync<I>::handle_refresh_object_map(int r) { | ||
dout(20) << dendl; | ||
void ImageSync<I>::handle_unset_sync_point_snap_contex(int r) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: typo
src/tools/rbd_mirror/ImageSync.h
Outdated
* | | ||
* v | ||
* COPY_IMAGE . . . . . . . . . . . . . . | ||
* | . | ||
* v . | ||
* COPY_OBJECT_MAP (skip if object . | ||
* | map disabled) . | ||
* UNSET_SYNC_POINT_SNAP_CONTEX . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: typo
src/tools/rbd_mirror/ImageSync.h
Outdated
@@ -79,18 +76,15 @@ class ImageSync : public BaseRequest { | |||
* CREATE_SYNC_POINT (skip if already exists and | |||
* | not disconnected) | |||
* v | |||
* COPY_SNAPSHOTS | |||
* SET_SYNC_POINT_SNAP_CONTEX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: typo
.WillOnce(Invoke([this, &mock_snapshot_copy_request, r]() { | ||
m_threads->work_queue->queue(mock_snapshot_copy_request.on_finish, r); | ||
})); | ||
void expect_set_sync_point_snap_contex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: typo
.WillOnce(Invoke([this](Context *ctx) { | ||
m_threads->work_queue->queue(ctx, 0); | ||
})); | ||
void expect_unset_sync_point_snap_contex( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: typo
@trociny Looks really good to me -- just nits basically. Might be even clearer history-wise if you just |
} | ||
|
||
// prepare the object map state | ||
// XXXMG: how to support OBJECT_EXISTS_CLEAN? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman What do you think about this? The commented code is from the original image_sync. Now, thinking about this more it looks to me I can safely use the same, i.e. just uncomment this code and remove m_snap_object_states[end_dst_snap_id] = OBJECT_EXISTS
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems OK to me
ec00195
to
eb8bd12
Compare
@dillaman updated according to your comments + unit tests added. Though right now I think I unnecessarily limited I am planning to generalize |
eb8bd12
to
5709ae6
Compare
@dillaman updated. The commit that makes rbd-mirror to use generalized copy deep is still here to show how deep copy is used. I can push it as a separated PR if it makes reviewing and merging easier. |
src/librbd/DeepCopyRequest.cc
Outdated
return -EINVAL; | ||
} | ||
|
||
if (m_src_image_ctx->snap_info.find(m_snap_id_end) == |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the rbd CLI's deep-copy, it might be nice to support copying to the HEAD revision if desired (i.e. if the image is not being actively modified, no need to create a snapshot).
src/librbd/DeepCopyRequest.cc
Outdated
} | ||
|
||
template <typename I> | ||
void DeepCopyRequest<I>::send_remove_copy_snapshot() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this state machine didn't create the snapshot, perhaps it shouldn't be responsible for deleting it -- especially if the goal is to have rbd-mirror use its own snapshot namespace eventually.
src/librbd/DeepCopyRequest.cc
Outdated
void DeepCopyRequest<I>::send_copy_object_map() { | ||
m_dst_image_ctx->owner_lock.get_read(); | ||
m_dst_image_ctx->snap_lock.get_read(); | ||
if (!m_dst_image_ctx->test_features(RBD_FEATURE_OBJECT_MAP, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... also skip if copying from the HEAD revision since the object copy should have updated the HEAD object map
src/librbd/DeepCopyRequest.cc
Outdated
#include "librbd/deep_copy/SnapshotCopyRequest.h" | ||
#include "librbd/operation/SnapshotRemoveRequest.h" | ||
|
||
#define dout_context g_ceph_context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: yank
#include "librbd/Utils.h" | ||
#include "osdc/Striper.h" | ||
|
||
#define dout_context g_ceph_context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: yank
m_snap_object_sizes = {}; | ||
m_src_object_offsets = {}; | ||
|
||
std::vector<std::pair<uint64_t, uint64_t>> file_extents; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: file
-> image
|
||
template <typename I> | ||
void ObjectCopyRequest<I>::send_list_snaps() { | ||
// starting copy from highest src object number is important for truncate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be able to handle a truncate of any source object -- and w/ striping, it's possible that the "end" image extent of the destination object is not align with the highest source object number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a destination object number x
is mapped to src object numbers a, b, c
(due to different object sizes) I assume that it will be always a < b < c
for any striping. Now, if reading src objects I see that c
does not exist, and b
is truncated, I can truncate the destination object up to b
truncated length. If b
is truncated but c
exists I know that I need to write a hole.
For this reason I start listing src objects from the highest one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If fancy striping is used, I am not sure that logic is correct:
src:
- object size 2
- stripe unit = 1
- stripe count = 3
dst:
- object size 4
- no fancy striping
The mapping would be: object 1@0~1, object 2@0~1, object 3@0~1, object 1@1~1
. If object 3 doesn't exist in this case, that's a hole and not a truncate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman It is good to know. Thanks!
bufferlist out_bl; | ||
}; | ||
|
||
typedef std::map<uint64_t, std::vector<std::pair<uint64_t, size_t>>> ObjectOffsets; // src_object_id -> list of (object_off, len) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: ObjectExtents
is the terminology used elsewhere
auto len = it.second; | ||
|
||
interval_set<uint64_t> copy_interval; | ||
copy_interval.insert(src_object_off, src_object_off + len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normal interval_set
insert logic is value and length not start and end values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about the complexity of this some more, I wonder if it would make more sense to refactor the CopyOp
computation logic and separate out the read and write ops into two different collections. The reads would be ordered by write/read snapshots and source object and the writes could be ordered by write snapshot.
We would change the flow a little so that for each source object, we first list its diff then perform the read. This allows us to avoid repeating all source object reads should a retry be required. Once all the reads have been performed across all source objects, we can perform the write stage.
When computing the diffs, instead of creating remove/zero/truncate ops, we could instead just track an interval set of zeroed locations for the destination object at the current write snapshot. Then as part of the write step, we would generate zero, truncate or remove ops I think pretty easily.
void send_update_object_map(); | ||
void handle_update_object_map(int r); | ||
|
||
Context *start_dst_op(RWLock &owner_lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: start_lock_op
or start_destination_op
@dillaman updated. Still open questions:
|
@trociny The |
32ad788
to
89294dd
Compare
@dillaman updated to deal with As for |
@trociny Sorry -- I was saying that I think the metadata copy should be moved to deep-copy as well. I was just trying to say that you could move it anywhere so long as it executes at the end of the deep-copy. |
@dillaman Thanks. I will move |
89294dd
to
e47f813
Compare
@dillaman updated |
e47f813
to
d62473f
Compare
@dillaman I have updated fsx to support deep-copy. I have not tested it well yet, and I still need to enable it in qa. Right now I have a question though. I had to link fsx with internal libraries. Might we want to add deep_copy to API instead? |
@trociny Yeah -- add it to the API since I can see this being added to the rbd CLI as well in a future PR. |
b40ad47
to
1a98ef3
Compare
submodules for project are unmodified |
all commits in this PR are signed |
OK - docs built |
@dillaman I have added deep_copy to API and enabled fsx qa tests. And observing fsx test failures on teuthology now [1], which I still need to track down (failed to reproduce locally so far, the failed fsx command successfully completes on my "vstart" env). Meantime I have questions about deep_copy API:
[1] http://pulpito.ceph.com/trociny-2017-11-01_10:45:27-rbd-wip-mgolub-testing-distro-basic-smithi/ |
make check succeeded |
1 similar comment
make check succeeded |
src/include/rbd/librbd.h
Outdated
CEPH_RBD_API int rbd_deep_copy(rbd_image_t src, rados_ioctx_t dest_io_ctx, | ||
const char *destname, rbd_image_options_t dest_opts); | ||
CEPH_RBD_API int rbd_deep_copy_with_progress(rbd_image_t image, | ||
rados_ioctx_t dest_p, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: dest_io_ctx
?
1a98ef3
to
8750cf4
Compare
for (auto &p : src_object_extents) { | ||
for (auto &s : p.second) { | ||
m_src_objects.insert(s.objectno); | ||
m_src_object_extents.push_back({s.objectno, s.offset, s.length}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman I think I have found an issue here.
I need m_src_object_extents
list to be ordered by a destination object offset (so I could sequentially write extents to the destination object; alternatively I would need to track src_offset->dst_offset). But I did not realize Striper::file_to_extents
returned a map ordered by object and the order is lost. So I need to sort extents returned in src_object_extents
.
Right now I don't see a good way. It looks like using Striper::extent_to_file
+ Striper::file_to_extents
like above I can't map dst_object_offset
-> src_object, src_object_offset
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it looks like I just need to use a different version of Striper::file_to_extents
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the version that returns a list, uses assimilate_extents
which does the same that my code above, so it returns the list in wrong order (
It looks I would need to extend Striper
to provide file_to_extents
version I need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman I fixed this by introducing src_to_dst_object_offset()
(see the fixup commit 48029c2). Do you think it could be a solution or do we need something more efficient (e.g. update Striper
to return results we need)?
After this change fsx
still fails but in different place. Need to track this further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman Another problem with compute_src_object_extents()
found: I need to split src object extents, if the destination stripe_unit is smaller than the source stripe_unit. The fixup commit 9024db9 contains the fix for this issue too. Now it is much better: I have not been able to make fsx fail so far!
I will squash the fixup after your review.
f1ff5fe
to
9024db9
Compare
teuthology fsx tests pass now: http://pulpito.ceph.com/trociny-2017-11-04_09:03:20-rbd-wip-mgolub-testing-distro-basic-ovh/ |
assert(s.length >= stripe_unit); | ||
auto dst_object_offset = src_to_dst_object_offset(s.objectno, s.offset); | ||
m_src_object_extents[dst_object_offset] = {s.objectno, s.offset, | ||
stripe_unit}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if the dst_object_offset
is in the middle of a dest stripe? For example, if the source extent was 2~2
(stripe of 2) and the destination extext was 0~3
(stripe of 3), the first byte of the source extent would map to the destination extent 2~1
but the last byte of the min stripe would go somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman I am not sure I see exactly what you mean, but taking the condition that is used for stripe_unit
on image creation:
(1 << order) % stripe_unit == 0 && stripe_unit <= (1 << order)
which I think can be rewritten as:
stripe_unit == (1 << x), where x <= order
which means that a situation like you described is not possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK -- fair enough if odd stripe units are prevented. We have plenty of "fsx" testing time to catch any potential regressions. Go ahead and squash the fixup commit (or just prefix it w/ "librbd") for merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman thanks. Rebased.
9024db9
to
ebdeaa0
Compare
} // namespace librbd | ||
|
||
// template definitions | ||
template class librbd::DeepCopyRequest<librbd::MockTestImageCtx>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: no need to explicitly instantiate -- it's causing a compile warning:
In file included from /home/jdillaman/ceph/src/test/librbd/test_mock_DeepCopyRequest.cc:148:0:
/home/jdillaman/ceph/src/librbd/DeepCopyRequest.cc:64:6: warning: ‘void librbd::DeepCopyRequest::cancel() [with ImageCtxT = librbd::{anonymous}::MockTestImageCtx]’ defined but not used [-Wunused-function]
void DeepCopyRequest::cancel() {
^~~~~~~~~~~~~~~~~~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman fixed. Thanks.
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
(so it could be used for logging image options) Signed-off-by: Mykola Golub <to.my.trociny@gmail.com>
(based on rbd-mirror image sync) Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <mgolub@mirantis.com>
Signed-off-by: Mykola Golub <to.my.trociny@gmail.com>
Signed-off-by: Mykola Golub <to.my.trociny@gmail.com>
Signed-off-by: Mykola Golub <to.my.trociny@gmail.com>
ebdeaa0
to
3a09801
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
(mostly taken from rbd-mirror image sync)
Signed-off-by: Mykola Golub mgolub@mirantis.com