Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rbd-mirror: delay update snapshot mirror image state #39432

Merged
merged 3 commits into from
Feb 22, 2021

Conversation

dillaman
Copy link

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

The non-primary mirror snapshot is what is used to link the non-primary
to the primary image. If there is an interruption between creating the
non-primary image and the creation of the first non-primary snapshot,
the images will be considerered unlinked.

A future commit will modify librbd to avoid setting the mirror image
state to enabled for non-primary snapshot-based mirroring images.
rbd-mirror will already automatically delete images in the CREATING
state during the bootstrap phase.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
derr << "incomplete local non-primary snapshot" << dendl;
handle_replay_complete(locker, -EINVAL,
"incomplete local non-primary snapshot");
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dillaman observing mirror snapshot stress test failures due to a replayer failed on start with "incomplete local non-primary snapshot" [1].

I looked at one case [2]. It is observed after an image is resumed on one instance and acquired on another. And when it is being resumed, it is in process of replaying a snapshot: creates the destination snapshot but before starting copying detects that shutdown is requested leaving incomplete snapshot.

cluster1-client.mirror.1.44130.log.gz:

2021-02-15T10:54:22.388+0000 7fc72fa5f700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 shut_down: 
2021-02-15T10:54:22.388+0000 7fc72fa5f700 10 rbd::mirror::InstanceWatcher: 0x559a81d24a80 cancel_sync_request: sync_id=12c925a0ebfc
2021-02-15T10:54:22.388+0000 7fc72fa5f700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 shut_down: shut down pending on completion of snapshot replay
....
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 handle_create_non_primary_snapshot: r=0
2021-02-15T10:54:23.490+0000 7fc729252700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 handle_create_non_primary_snapshot: local_snap_id_end=239
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 request_sync: 
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 is_replay_interrupted: resuming pending shut down
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 unregister_remote_update_watcher: 
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 handle_unregister_remote_update_watcher: r=0
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 unregister_local_update_watcher: 
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 handle_unregister_local_update_watcher: r=0
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 wait_for_in_flight_ops: 
2021-02-15T10:54:23.490+0000 7fc72f25e700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 handle_wait_for_in_flight_ops: r=0
2021-02-15T10:54:23.490+0000 7fc72f25e700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 ~Replayer: 

And then when another instance starts the replayer, it fails due to the snapshot incomplete:

cluster1-client.mirror.1.55775.log.gz:

2021-02-15T10:54:51.553+0000 7f389548f700 10 rbd::mirror::InstanceReplayer: 0x55baf4aec280 start_image_replayer: global_image_id=ab31b979-007c-49bc-8088-544ae5494ad1
2021-02-15T10:54:51.553+0000 7f389548f700 10 rbd::mirror::ImageReplayer: 0x55baf66c2280 [3/ab31b979-007c-49bc-8088-544ae5494ad1] start: on_finish=0x55baf667b1e0
...
2021-02-15T10:54:51.659+0000 7f388ec82700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55baf6d2b400 scan_local_mirror_snapshots: local mirror snapshot: id=107, mirror_ns=[mirror state=non-primary, complete=1, mirror_peer_uuids=, primary_mirror_uuid=479e16be-4a87-4fd4-b053-c75a5ca44436, primary_snap_id=6c, last_copied_object_number=0, snap_seqs={108=18446744073709551614}]
2021-02-15T10:54:51.659+0000 7f388ec82700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55baf6d2b400 scan_local_mirror_snapshots: local mirror snapshot: id=239, mirror_ns=[mirror state=non-primary, complete=0, mirror_peer_uuids=, primary_mirror_uuid=479e16be-4a87-4fd4-b053-c75a5ca44436, primary_snap_id=ef, last_copied_object_number=0, snap_seqs={238=238,239=18446744073709551614}]
2021-02-15T10:54:51.659+0000 7f388ec82700 -1 rbd::mirror::image_replayer::snapshot::Replayer: 0x55baf6d2b400 scan_local_mirror_snapshots: incomplete local non-primary snapshot
2021-02-15T10:54:51.659+0000 7f388ec82700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55baf6d2b400 notify_status_updated: 
2021-02-15T10:54:51.659+0000 7f389548f700 10 rbd::mirror::ImageReplayer: 0x55baf66c2280 [3/ab31b979-007c-49bc-8088-544ae5494ad1] handle_replayer_notification: 
2021-02-15T10:54:51.659+0000 7f389548f700 10 rbd::mirror::ImageReplayer: 0x55baf66c2280 [3/ab31b979-007c-49bc-8088-544ae5494ad1] handle_replayer_notification: replay interrupted: r=-22, error=incomplete local non-primary snapshot

[1] https://pulpito.ceph.com/trociny-2021-02-15_09:32:56-rbd-wip-mgolub-testing-distro-basic-smithi/
[2] http://qa-proxy.ceph.com/teuthology/trociny-2021-02-15_09:32:56-rbd-wip-mgolub-testing-distro-basic-smithi/5883923/teuthology.log

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops -- completed_non_primary_snapshots_exist was incorrectly being re-initialized to false within the loop.

Jason Dillaman added 2 commits February 19, 2021 10:46
Tweak the normal pruning behavior to ensure that an incomplete initial
non-primary snapshot is not included in the prune set since we know
it will be complete since otherwise the image would have been deleted
due to not updating the mirror-image-state to enabled. Also ensure
we cannot prune a non-primary mirror snapshot if we don't have a
predecessor.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The creating state is a special case in rbd-mirror where it will
automatically delete the image since it assumes it's malformed.
A non-primary, snapshot-based mirror image needs to have at least
one non-primary snapshot and the first one is not created until
after replay has started. Now rbd-mirror will update the mirror
image state to the enabled state after creating the first
non-primary snapshot but before attempting the sync.

Fixes: https://tracker.ceph.com/issues/49238
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
Copy link
Contributor

@trociny trociny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@trociny trociny merged commit d8b02ae into ceph:master Feb 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants