New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
librbd: support migrating images with minimal downtime #15831
Conversation
@dillaman
|
@@ -31,6 +31,13 @@ class FlattenRequest : public Request<ImageCtxT> | |||
void send_op() override; | |||
bool should_complete(int r) override; | |||
|
|||
int filter_return_code(int r) const override { | |||
if (r == -ENOENT && m_ignore_enoent) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be limited to m_state == STATE_UPDATE_HEADER
and the similar handling be removed from should_complete
?
src/librbd/api/Migrate.cc
Outdated
src_name = ".migrate." + m_image_ctx->name; // XXXMG | ||
src_header_oid = util::old_header_name(src_name); | ||
|
||
r = tmap_rm(m_src_io_ctx, m_image_ctx->name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atomically add the entry for the new ```src_name`` and remove the old name under a single op perhaps?
src/librbd/api/Migrate.cc
Outdated
} | ||
|
||
|
||
ldout(m_cct, 20) << "moving " << m_image_ctx->header_oid << " -> " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps an alternative would be to just bump the v1 header version RBD_HEADER_VERSION
and add support for the redirect spec directly so all this logic of "renaming" the header can be avoided. It also opens the door for krbd to add a small tweak to detect the bumped header w/ a redirect pointer to a v2 image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman Back to this PR. I am not sure I got your idea about using RBD_HEADER_VERSION.
If understand you right, new clients could detect in some way (redirect header spec) the image is being migrated and could be redirected to the new header. But what about older clients? Would bumping ondisk header version forbid older clients from opening this image? I have failed to find the relevant code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm -- you are right that nothing appears to care about the header version.
src/librbd/api/Migrate.cc
Outdated
src_header_oid = m_image_ctx->header_oid; | ||
|
||
r = trash_move(m_src_io_ctx, RBD_TRASH_IMAGE_SOURCE_USER, m_image_ctx->name, | ||
0 /* XXXMG: use some reasonable expiration time? */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need some protection to prevent the image from being deleted from the trash while a migration is in-progress. So long as that protection exists, there is no real need for a time limit. Once the operator is comfortable that all old references to the VM have been removed (assuming it's a pool migration), it's safe to delete the image once its migration completes.
src/librbd/api/Migrate.cc
Outdated
|
||
cls::rbd::MigrateSpec migrate_spec = | ||
{cls::rbd::MIGRATE_TYPE_SRC, m_dst_io_ctx.get_id(), m_dstname, image_id}; | ||
r = cls_client::migrate_set(&m_src_io_ctx, src_header_oid, migrate_spec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use a 2 phase approach where you set a "migration pending" state on the image before doing anything else, then commit the migration record here.
src/librbd/api/Migrate.cc
Outdated
|
||
migrate_spec = | ||
{cls::rbd::MIGRATE_TYPE_DST, m_src_io_ctx.get_id(), src_name, src_id}; | ||
r = cls_client::migrate_set(&m_dst_io_ctx, util::header_name(image_id), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should set this on the dest image before linking the source to the destination
src/librbd/image/RefreshRequest.cc
Outdated
} | ||
} | ||
|
||
if (*result == -ENOENT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--- or -EOPNOTSUPP
src/librbd/image/RefreshRequest.cc
Outdated
parent_md->spec.image_id = m_migrate_spec.image_id; | ||
parent_md->spec.snap_id = CEPH_NOSNAP; | ||
parent_md->spec.migrate_source = true; | ||
parent_md->overlap = m_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would need to track the size to prevent edge-case oddities just like the clone parent overlap size is stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean I need to store image size into migrate_spec and use it here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just mean that you would have to store the overlap on disk instead of just assigning it the image size upon refresh.
src/librbd/api/Migrate.cc
Outdated
{ | ||
RWLock::RLocker snap_locker(ictx->snap_lock); | ||
if (ictx->snaps.size() > 0) { | ||
// XXXMG: or rather check for children? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm -- snapshots and clones are definitely an issue. I think this becomes limited value if we cannot transparently migrate at least snapshots. That logic can at least be copied from rbd-mirror image sync, but that implies the need for a special copy-up logic that can deep-sync snapshot state.
In fact, thinking about such a scenario -- as complicated as that seems -- might actually be a win-win. If we could have the "local" mirrored image be registered as a sync sink to the "remote" primary mirrored image, the actual block sync is really just basically a flatten, and we don't need to pause journal replay for a full bootstrap sync since the sync "copy-up" logic would ensure the underlying "remote" image block was properly deeply copied to the non-primary before writing the journal event to the block.
Probably the best approach would be a temporary RW feature bit that is set on the original/destination images while migration is in-progress.
Could add a param to
See inline comment
I'd normally say "run it again" -- but perhaps a |
Now that this design is starting to get fleshed out -- I am starting to worry about its scope. It's definitely going to be a hard problem with lots of edge conditions. Maybe we should start smaller -- like a "copy-deep" function that can do the deep snapshot sync/copy-up in a generic fashion? |
@dillaman Thanks, I can start from "copy-deep" function. It looks like the implementation could be:
For (1-3) a care should be taken to prevent open by another client until (3) is complete. Not sure how (5) should be implemented though. Now, instead of just reading from the parent I would need to iterate over snap list and do copy up for every snapshot. It looks like it is troublesome to just reuse rbd_mirror/image_sync/ObjectCopyRequest because object size may be different for a parent in general case. I imagine that DeepCopyupRequest could be something like this: list snaps, and for every snap call SetSnapRequest + CopyupRequest. But I suppose it would require blocking concurrent operations and probably would be far from optimal? |
@trociny I wouldn't think we would need to track the copy source on-disk since if there is a crash, delete the image and start again (just like w/ the current "copy" method). I think the copy-up / deep object sync logic could still be shared and made generic enough to work. Yes, the current object sync works on an object by object basis, but it could just as easily map the destination object's image extents back to the backing source image's object(s), read the snap sets, read (if necessary) from the source image, and commit the changes to the destination. There shouldn't be a need to switch the entire destination image over to a particular snapshot nor block requests. |
@dillaman I will try your suggestion. Thanks. I wanted to add "copy_from" header on a destination so it could be safely opened by other clients when flattening is still in progress. This would be used by migration (instead of adding migrate header) and might be useful for other applications. But yes, this does not look like necessary, and I can do without this. |
6948db0
to
6bfd916
Compare
757bfed
to
ab8a19d
Compare
8cad00a
to
5254517
Compare
30825fb
to
f2aeef6
Compare
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
…NOSNAP (it is possible now when the parent is a migration source) Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com> Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Use the pointer in copyup and migrate requests, when deep copying an object. Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
if copyup request was appended with non empty writes when the deep copy was in flight. Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
@dillaman thanks, updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@trociny can you rebase so that I can merge?
librbd: support migrating images with minimal downtime Reviewed-by: Jason Dillaman <dillaman@redhat.com>
Manually merged -- closing |
This implements the first stage for the feature to transparently support migrating images with minimal/zero downtime [1]
A new API/CLI function "migrate" has been added, which runs migration by creating a new image header, setting the old image as parent and starting flattening process. Currently it requires the image being migrated is not opened RW (watched) at the moment when "migrate" function is called. After migration is started other clients may start using the image not waiting for migration to complete, thus it allows to migrate images with minimal downtime.
The next stage is to make migration transparent (zero downtime) for new clients. It will require:
[1] http://tracker.ceph.com/issues/18430