librbd: support migrating images with minimal downtime #15831

trociny · 2017-06-22T07:14:14Z

This implements the first stage for the feature to transparently support migrating images with minimal/zero downtime [1]

A new API/CLI function "migrate" has been added, which runs migration by creating a new image header, setting the old image as parent and starting flattening process. Currently it requires the image being migrated is not opened RW (watched) at the moment when "migrate" function is called. After migration is started other clients may start using the image not waiting for migration to complete, thus it allows to migrate images with minimal downtime.

The next stage is to make migration transparent (zero downtime) for new clients. It will require:

add support to track client API requests to be able to block them when reopening image transparently (discussed in [DNM] librbd: experimenting with live image substitution #15491);
add new image watcher messages and protocol to coordinate migration (and fail if the image is being used by older client).

[1] http://tracker.ceph.com/issues/18430

trociny · 2017-06-22T07:22:46Z

@dillaman
This is still WIP, I would like to discuss the approach and these questions (see also comments in the code):

New clients will check for migrate header set and will fail to open image in readwrite mode. But older clients will still be able to open it. I suppose we can make them fail on server side, e.g. on get_mutable_metadata. But it looks like there is no a standard way to provide versioning to cls functions? It could be workarounded by extending cls_client::get_parent to require additional argument (support migration flag). And if it were not provided it would mean the older client and the server would return -EPERM if migration header is set. Do you see a better solution?
Not sure about the best order of operations. E.g. it would be nice to set migrate header as the first step, so other clients fail opening the image, but this complicates migrate, because trash_move(), which is used by 'migrate', will fail too.
For old format image the first step is to rename the header to ".migrate.${name}", which is still accessible to user and theoretically may fail. Can this be improved?
The rollback is not implemented yet. If migrate fails due to operation is not supported by server (and may be for some other errors), rollback will be done automatically on migrate return. If migrate fails for some other reasons, I see two approaches:
1. after fixing the issue user tries migrate again (similar to how rbd remove works currently);
2. migrate_rollback command is used to rollback interrupted migrate (fix headers and apply changes from the child image if they exist).
  What approach should we use? May be both?

dillaman · 2017-06-28T00:54:15Z

src/librbd/operation/FlattenRequest.h

@@ -31,6 +31,13 @@ class FlattenRequest : public Request<ImageCtxT>
  void send_op() override;
  bool should_complete(int r) override;

+  int filter_return_code(int r) const override {
+    if (r == -ENOENT && m_ignore_enoent) {


Should this be limited to m_state == STATE_UPDATE_HEADER and the similar handling be removed from should_complete?

dillaman · 2017-06-28T01:33:10Z

src/librbd/api/Migrate.cc

+    src_name = ".migrate." + m_image_ctx->name; // XXXMG
+    src_header_oid = util::old_header_name(src_name);
+
+    r = tmap_rm(m_src_io_ctx, m_image_ctx->name);


Atomically add the entry for the new ```src_name`` and remove the old name under a single op perhaps?

dillaman · 2017-06-28T01:39:51Z

src/librbd/api/Migrate.cc

+    }
+
+
+    ldout(m_cct, 20) << "moving " << m_image_ctx->header_oid << " -> "


Perhaps an alternative would be to just bump the v1 header version RBD_HEADER_VERSION and add support for the redirect spec directly so all this logic of "renaming" the header can be avoided. It also opens the door for krbd to add a small tweak to detect the bumped header w/ a redirect pointer to a v2 image.

@dillaman Back to this PR. I am not sure I got your idea about using RBD_HEADER_VERSION.

If understand you right, new clients could detect in some way (redirect header spec) the image is being migrated and could be redirected to the new header. But what about older clients? Would bumping ondisk header version forbid older clients from opening this image? I have failed to find the relevant code.

Hmm -- you are right that nothing appears to care about the header version.

dillaman · 2017-06-28T01:42:42Z

src/librbd/api/Migrate.cc

+    src_header_oid = m_image_ctx->header_oid;
+
+    r = trash_move(m_src_io_ctx, RBD_TRASH_IMAGE_SOURCE_USER, m_image_ctx->name,
+                   0 /* XXXMG: use some reasonable expiration time? */);


We will need some protection to prevent the image from being deleted from the trash while a migration is in-progress. So long as that protection exists, there is no real need for a time limit. Once the operator is comfortable that all old references to the VM have been removed (assuming it's a pool migration), it's safe to delete the image once its migration completes.

dillaman · 2017-06-28T01:43:37Z

src/librbd/api/Migrate.cc

+
+  cls::rbd::MigrateSpec migrate_spec =
+    {cls::rbd::MIGRATE_TYPE_SRC, m_dst_io_ctx.get_id(), m_dstname, image_id};
+  r = cls_client::migrate_set(&m_src_io_ctx, src_header_oid, migrate_spec);


You can use a 2 phase approach where you set a "migration pending" state on the image before doing anything else, then commit the migration record here.

dillaman · 2017-06-28T01:43:55Z

src/librbd/api/Migrate.cc

+
+  migrate_spec =
+    {cls::rbd::MIGRATE_TYPE_DST, m_src_io_ctx.get_id(), src_name, src_id};
+  r = cls_client::migrate_set(&m_dst_io_ctx, util::header_name(image_id),


Probably should set this on the dest image before linking the source to the destination

dillaman · 2017-06-28T01:46:17Z

src/librbd/image/RefreshRequest.cc

+    }
+  }
+
+  if (*result == -ENOENT) {


--- or -EOPNOTSUPP

dillaman · 2017-06-28T01:47:58Z

src/librbd/image/RefreshRequest.cc

+  parent_md->spec.image_id = m_migrate_spec.image_id;
+  parent_md->spec.snap_id = CEPH_NOSNAP;
+  parent_md->spec.migrate_source = true;
+  parent_md->overlap = m_size;


You would need to track the size to prevent edge-case oddities just like the clone parent overlap size is stored.

Do you mean I need to store image size into migrate_spec and use it here?

I just mean that you would have to store the overlap on disk instead of just assigning it the image size upon refresh.

dillaman · 2017-06-28T02:02:38Z

src/librbd/api/Migrate.cc

+  {
+    RWLock::RLocker snap_locker(ictx->snap_lock);
+    if (ictx->snaps.size() > 0) {
+      // XXXMG: or rather check for children?


Hmm -- snapshots and clones are definitely an issue. I think this becomes limited value if we cannot transparently migrate at least snapshots. That logic can at least be copied from rbd-mirror image sync, but that implies the need for a special copy-up logic that can deep-sync snapshot state.

In fact, thinking about such a scenario -- as complicated as that seems -- might actually be a win-win. If we could have the "local" mirrored image be registered as a sync sink to the "remote" primary mirrored image, the actual block sync is really just basically a flatten, and we don't need to pause journal replay for a full bootstrap sync since the sync "copy-up" logic would ensure the underlying "remote" image block was properly deeply copied to the non-primary before writing the journal event to the block.

dillaman · 2017-06-28T02:07:50Z

New clients will check for migrate header set and will fail to open image in readwrite mode. But older clients will still be able to open it. I suppose we can make them fail on server side, e.g. on get_mutable_metadata. But it looks like there is no a standard way to provide versioning to cls functions? It could be workarounded by extending cls_client::get_parent to require additional argument (support migration flag). And if it were not provided it would mean the older client and the server would return -EPERM if migration header is set. Do you see a better solution?

Probably the best approach would be a temporary RW feature bit that is set on the original/destination images while migration is in-progress.

Not sure about the best order of operations. E.g. it would be nice to set migrate header as the first step, so other clients fail opening the image, but this complicates migrate, because trash_move(), which is used by 'migrate', will fail too.

Could add a param to trash_move to override the names

For old format image the first step is to rename the header to ".migrate.${name}", which is still accessible to user and theoretically may fail. Can this be improved?

See inline comment

The rollback is not implemented yet. If migrate fails due to operation is not supported by server (and may be for some other errors), rollback will be done automatically on migrate return. If migrate fails for some other reasons, I see two approaches:

after fixing the issue user tries migrate again (similar to how rbd remove works currently);

migrate_rollback command is used to rollback interrupted migrate (fix headers and apply changes from the child image if they exist).
What approach should we use? May be both?

I'd normally say "run it again" -- but perhaps a migrate abort command might be helpful (I'd hate to say rollback since that might imply data changes from the destination image are rolled back to the original source). Since the goal was zero downtime, if there is an issue, it would be great to get an image working again ASAP.

dillaman · 2017-06-28T02:14:18Z

Now that this design is starting to get fleshed out -- I am starting to worry about its scope. It's definitely going to be a hard problem with lots of edge conditions. Maybe we should start smaller -- like a "copy-deep" function that can do the deep snapshot sync/copy-up in a generic fashion?

trociny · 2017-06-29T09:18:42Z

@dillaman Thanks, I can start from "copy-deep" function.

It looks like the implementation could be:

create a new image header;
create (copy) snapshots metadata;
add "copy_from" header with source image spec;
make RefreshRequest to look for copy_from header and use it as a parent;
modify CopyupRequest (create new DeepCopyupRequest?) to do the deep snapshot sync/copy-up

For (1-3) a care should be taken to prevent open by another client until (3) is complete.
(2) can be similar to rbd_mirror/image_sync/SnapshotCopyRequest.
(4) is just what is currently implemented in this PR for migrate header + your comments.

Not sure how (5) should be implemented though. Now, instead of just reading from the parent I would need to iterate over snap list and do copy up for every snapshot. It looks like it is troublesome to just reuse rbd_mirror/image_sync/ObjectCopyRequest because object size may be different for a parent in general case.

I imagine that DeepCopyupRequest could be something like this: list snaps, and for every snap call SetSnapRequest + CopyupRequest. But I suppose it would require blocking concurrent operations and probably would be far from optimal?

dillaman · 2017-06-29T11:57:03Z

@trociny I wouldn't think we would need to track the copy source on-disk since if there is a crash, delete the image and start again (just like w/ the current "copy" method). I think the copy-up / deep object sync logic could still be shared and made generic enough to work. Yes, the current object sync works on an object by object basis, but it could just as easily map the destination object's image extents back to the backing source image's object(s), read the snap sets, read (if necessary) from the source image, and commit the changes to the destination. There shouldn't be a need to switch the entire destination image over to a particular snapshot nor block requests.

trociny · 2017-06-29T12:17:32Z

@dillaman I will try your suggestion. Thanks.

I wanted to add "copy_from" header on a destination so it could be safely opened by other clients when flattening is still in progress. This would be used by migration (instead of adding migrate header) and might be useful for other applications. But yes, this does not look like necessary, and I can do without this.