Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

librbd: support migrating images with minimal downtime #15831

Closed
wants to merge 35 commits into from

Conversation

trociny
Copy link
Contributor

@trociny trociny commented Jun 22, 2017

This implements the first stage for the feature to transparently support migrating images with minimal/zero downtime [1]

A new API/CLI function "migrate" has been added, which runs migration by creating a new image header, setting the old image as parent and starting flattening process. Currently it requires the image being migrated is not opened RW (watched) at the moment when "migrate" function is called. After migration is started other clients may start using the image not waiting for migration to complete, thus it allows to migrate images with minimal downtime.

The next stage is to make migration transparent (zero downtime) for new clients. It will require:

[1] http://tracker.ceph.com/issues/18430

@trociny
Copy link
Contributor Author

trociny commented Jun 22, 2017

@dillaman
This is still WIP, I would like to discuss the approach and these questions (see also comments in the code):

  • New clients will check for migrate header set and will fail to open image in readwrite mode. But older clients will still be able to open it. I suppose we can make them fail on server side, e.g. on get_mutable_metadata. But it looks like there is no a standard way to provide versioning to cls functions? It could be workarounded by extending cls_client::get_parent to require additional argument (support migration flag). And if it were not provided it would mean the older client and the server would return -EPERM if migration header is set. Do you see a better solution?
  • Not sure about the best order of operations. E.g. it would be nice to set migrate header as the first step, so other clients fail opening the image, but this complicates migrate, because trash_move(), which is used by 'migrate', will fail too.
  • For old format image the first step is to rename the header to ".migrate.${name}", which is still accessible to user and theoretically may fail. Can this be improved?
  • The rollback is not implemented yet. If migrate fails due to operation is not supported by server (and may be for some other errors), rollback will be done automatically on migrate return. If migrate fails for some other reasons, I see two approaches:
    1. after fixing the issue user tries migrate again (similar to how rbd remove works currently);
    2. migrate_rollback command is used to rollback interrupted migrate (fix headers and apply changes from the child image if they exist).
      What approach should we use? May be both?

@@ -31,6 +31,13 @@ class FlattenRequest : public Request<ImageCtxT>
void send_op() override;
bool should_complete(int r) override;

int filter_return_code(int r) const override {
if (r == -ENOENT && m_ignore_enoent) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be limited to m_state == STATE_UPDATE_HEADER and the similar handling be removed from should_complete?

src_name = ".migrate." + m_image_ctx->name; // XXXMG
src_header_oid = util::old_header_name(src_name);

r = tmap_rm(m_src_io_ctx, m_image_ctx->name);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomically add the entry for the new ```src_name`` and remove the old name under a single op perhaps?

}


ldout(m_cct, 20) << "moving " << m_image_ctx->header_oid << " -> "

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps an alternative would be to just bump the v1 header version RBD_HEADER_VERSION and add support for the redirect spec directly so all this logic of "renaming" the header can be avoided. It also opens the door for krbd to add a small tweak to detect the bumped header w/ a redirect pointer to a v2 image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman Back to this PR. I am not sure I got your idea about using RBD_HEADER_VERSION.

If understand you right, new clients could detect in some way (redirect header spec) the image is being migrated and could be redirected to the new header. But what about older clients? Would bumping ondisk header version forbid older clients from opening this image? I have failed to find the relevant code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm -- you are right that nothing appears to care about the header version.

src_header_oid = m_image_ctx->header_oid;

r = trash_move(m_src_io_ctx, RBD_TRASH_IMAGE_SOURCE_USER, m_image_ctx->name,
0 /* XXXMG: use some reasonable expiration time? */);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need some protection to prevent the image from being deleted from the trash while a migration is in-progress. So long as that protection exists, there is no real need for a time limit. Once the operator is comfortable that all old references to the VM have been removed (assuming it's a pool migration), it's safe to delete the image once its migration completes.


cls::rbd::MigrateSpec migrate_spec =
{cls::rbd::MIGRATE_TYPE_SRC, m_dst_io_ctx.get_id(), m_dstname, image_id};
r = cls_client::migrate_set(&m_src_io_ctx, src_header_oid, migrate_spec);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use a 2 phase approach where you set a "migration pending" state on the image before doing anything else, then commit the migration record here.


migrate_spec =
{cls::rbd::MIGRATE_TYPE_DST, m_src_io_ctx.get_id(), src_name, src_id};
r = cls_client::migrate_set(&m_dst_io_ctx, util::header_name(image_id),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably should set this on the dest image before linking the source to the destination

}
}

if (*result == -ENOENT) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--- or -EOPNOTSUPP

parent_md->spec.image_id = m_migrate_spec.image_id;
parent_md->spec.snap_id = CEPH_NOSNAP;
parent_md->spec.migrate_source = true;
parent_md->overlap = m_size;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would need to track the size to prevent edge-case oddities just like the clone parent overlap size is stored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean I need to store image size into migrate_spec and use it here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean that you would have to store the overlap on disk instead of just assigning it the image size upon refresh.

{
RWLock::RLocker snap_locker(ictx->snap_lock);
if (ictx->snaps.size() > 0) {
// XXXMG: or rather check for children?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm -- snapshots and clones are definitely an issue. I think this becomes limited value if we cannot transparently migrate at least snapshots. That logic can at least be copied from rbd-mirror image sync, but that implies the need for a special copy-up logic that can deep-sync snapshot state.

In fact, thinking about such a scenario -- as complicated as that seems -- might actually be a win-win. If we could have the "local" mirrored image be registered as a sync sink to the "remote" primary mirrored image, the actual block sync is really just basically a flatten, and we don't need to pause journal replay for a full bootstrap sync since the sync "copy-up" logic would ensure the underlying "remote" image block was properly deeply copied to the non-primary before writing the journal event to the block.

@dillaman
Copy link

dillaman commented Jun 28, 2017

  • New clients will check for migrate header set and will fail to open image in readwrite mode. But older clients will still be able to open it. I suppose we can make them fail on server side, e.g. on get_mutable_metadata. But it looks like there is no a standard way to provide versioning to cls functions? It could be workarounded by extending cls_client::get_parent to require additional argument (support migration flag). And if it were not provided it would mean the older client and the server would return -EPERM if migration header is set. Do you see a better solution?

Probably the best approach would be a temporary RW feature bit that is set on the original/destination images while migration is in-progress.

  • Not sure about the best order of operations. E.g. it would be nice to set migrate header as the first step, so other clients fail opening the image, but this complicates migrate, because trash_move(), which is used by 'migrate', will fail too.

Could add a param to trash_move to override the names

  • For old format image the first step is to rename the header to ".migrate.${name}", which is still accessible to user and theoretically may fail. Can this be improved?

See inline comment

  • The rollback is not implemented yet. If migrate fails due to operation is not supported by server (and may be for some other errors), rollback will be done automatically on migrate return. If migrate fails for some other reasons, I see two approaches:
    1. after fixing the issue user tries migrate again (similar to how rbd remove works currently);
    2. migrate_rollback command is used to rollback interrupted migrate (fix headers and apply changes from the child image if they exist).
      What approach should we use? May be both?

I'd normally say "run it again" -- but perhaps a migrate abort command might be helpful (I'd hate to say rollback since that might imply data changes from the destination image are rolled back to the original source). Since the goal was zero downtime, if there is an issue, it would be great to get an image working again ASAP.

@dillaman
Copy link

Now that this design is starting to get fleshed out -- I am starting to worry about its scope. It's definitely going to be a hard problem with lots of edge conditions. Maybe we should start smaller -- like a "copy-deep" function that can do the deep snapshot sync/copy-up in a generic fashion?

@trociny
Copy link
Contributor Author

trociny commented Jun 29, 2017

@dillaman Thanks, I can start from "copy-deep" function.

It looks like the implementation could be:

  1. create a new image header;
  2. create (copy) snapshots metadata;
  3. add "copy_from" header with source image spec;
  4. make RefreshRequest to look for copy_from header and use it as a parent;
  5. modify CopyupRequest (create new DeepCopyupRequest?) to do the deep snapshot sync/copy-up

For (1-3) a care should be taken to prevent open by another client until (3) is complete.
(2) can be similar to rbd_mirror/image_sync/SnapshotCopyRequest.
(4) is just what is currently implemented in this PR for migrate header + your comments.

Not sure how (5) should be implemented though. Now, instead of just reading from the parent I would need to iterate over snap list and do copy up for every snapshot. It looks like it is troublesome to just reuse rbd_mirror/image_sync/ObjectCopyRequest because object size may be different for a parent in general case.

I imagine that DeepCopyupRequest could be something like this: list snaps, and for every snap call SetSnapRequest + CopyupRequest. But I suppose it would require blocking concurrent operations and probably would be far from optimal?

@dillaman
Copy link

@trociny I wouldn't think we would need to track the copy source on-disk since if there is a crash, delete the image and start again (just like w/ the current "copy" method). I think the copy-up / deep object sync logic could still be shared and made generic enough to work. Yes, the current object sync works on an object by object basis, but it could just as easily map the destination object's image extents back to the backing source image's object(s), read the snap sets, read (if necessary) from the source image, and commit the changes to the destination. There shouldn't be a need to switch the entire destination image over to a particular snapshot nor block requests.

@trociny
Copy link
Contributor Author

trociny commented Jun 29, 2017

@dillaman I will try your suggestion. Thanks.

I wanted to add "copy_from" header on a destination so it could be safely opened by other clients when flattening is still in progress. This would be used by migration (instead of adding migrate header) and might be useful for other applications. But yes, this does not look like necessary, and I can do without this.

@trociny trociny force-pushed the wip-18430-1 branch 2 times, most recently from 6948db0 to 6bfd916 Compare October 14, 2017 20:00
@trociny trociny force-pushed the wip-18430-1 branch 3 times, most recently from 757bfed to ab8a19d Compare November 9, 2017 17:42
@trociny trociny force-pushed the wip-18430-1 branch 4 times, most recently from 8cad00a to 5254517 Compare November 23, 2017 19:44
@trociny trociny force-pushed the wip-18430-1 branch 10 times, most recently from 30825fb to f2aeef6 Compare November 29, 2017 20:49
trociny and others added 23 commits July 26, 2018 08:34
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
…NOSNAP

(it is possible now when the parent is a migration source)

Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Use the pointer in copyup and migrate requests, when deep copying an
object.

Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
if copyup request was appended with non empty writes when the deep
copy was in flight.

Signed-off-by: Mykola Golub <mgolub@suse.com>
Signed-off-by: Mykola Golub <mgolub@suse.com>
@trociny
Copy link
Contributor Author

trociny commented Jul 26, 2018

@dillaman thanks, updated

Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trociny can you rebase so that I can merge?

dillaman pushed a commit that referenced this pull request Aug 14, 2018
librbd: support migrating images with minimal downtime

Reviewed-by: Jason Dillaman <dillaman@redhat.com>
@dillaman
Copy link

Manually merged -- closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants