-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rbd-mirror: delay update snapshot mirror image state #39432
Conversation
The non-primary mirror snapshot is what is used to link the non-primary to the primary image. If there is an interruption between creating the non-primary image and the creation of the first non-primary snapshot, the images will be considerered unlinked. A future commit will modify librbd to avoid setting the mirror image state to enabled for non-primary snapshot-based mirroring images. rbd-mirror will already automatically delete images in the CREATING state during the bootstrap phase. Signed-off-by: Jason Dillaman <dillaman@redhat.com>
derr << "incomplete local non-primary snapshot" << dendl; | ||
handle_replay_complete(locker, -EINVAL, | ||
"incomplete local non-primary snapshot"); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dillaman observing mirror snapshot stress test failures due to a replayer failed on start with "incomplete local non-primary snapshot" [1].
I looked at one case [2]. It is observed after an image is resumed on one instance and acquired on another. And when it is being resumed, it is in process of replaying a snapshot: creates the destination snapshot but before starting copying detects that shutdown is requested leaving incomplete snapshot.
cluster1-client.mirror.1.44130.log.gz:
2021-02-15T10:54:22.388+0000 7fc72fa5f700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 shut_down:
2021-02-15T10:54:22.388+0000 7fc72fa5f700 10 rbd::mirror::InstanceWatcher: 0x559a81d24a80 cancel_sync_request: sync_id=12c925a0ebfc
2021-02-15T10:54:22.388+0000 7fc72fa5f700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 shut_down: shut down pending on completion of snapshot replay
....
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 handle_create_non_primary_snapshot: r=0
2021-02-15T10:54:23.490+0000 7fc729252700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 handle_create_non_primary_snapshot: local_snap_id_end=239
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 request_sync:
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 is_replay_interrupted: resuming pending shut down
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 unregister_remote_update_watcher:
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 handle_unregister_remote_update_watcher: r=0
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 unregister_local_update_watcher:
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 handle_unregister_local_update_watcher: r=0
2021-02-15T10:54:23.490+0000 7fc729252700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 wait_for_in_flight_ops:
2021-02-15T10:54:23.490+0000 7fc72f25e700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 handle_wait_for_in_flight_ops: r=0
2021-02-15T10:54:23.490+0000 7fc72f25e700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x559a87d6f000 ~Replayer:
And then when another instance starts the replayer, it fails due to the snapshot incomplete:
cluster1-client.mirror.1.55775.log.gz:
2021-02-15T10:54:51.553+0000 7f389548f700 10 rbd::mirror::InstanceReplayer: 0x55baf4aec280 start_image_replayer: global_image_id=ab31b979-007c-49bc-8088-544ae5494ad1
2021-02-15T10:54:51.553+0000 7f389548f700 10 rbd::mirror::ImageReplayer: 0x55baf66c2280 [3/ab31b979-007c-49bc-8088-544ae5494ad1] start: on_finish=0x55baf667b1e0
...
2021-02-15T10:54:51.659+0000 7f388ec82700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55baf6d2b400 scan_local_mirror_snapshots: local mirror snapshot: id=107, mirror_ns=[mirror state=non-primary, complete=1, mirror_peer_uuids=, primary_mirror_uuid=479e16be-4a87-4fd4-b053-c75a5ca44436, primary_snap_id=6c, last_copied_object_number=0, snap_seqs={108=18446744073709551614}]
2021-02-15T10:54:51.659+0000 7f388ec82700 15 rbd::mirror::image_replayer::snapshot::Replayer: 0x55baf6d2b400 scan_local_mirror_snapshots: local mirror snapshot: id=239, mirror_ns=[mirror state=non-primary, complete=0, mirror_peer_uuids=, primary_mirror_uuid=479e16be-4a87-4fd4-b053-c75a5ca44436, primary_snap_id=ef, last_copied_object_number=0, snap_seqs={238=238,239=18446744073709551614}]
2021-02-15T10:54:51.659+0000 7f388ec82700 -1 rbd::mirror::image_replayer::snapshot::Replayer: 0x55baf6d2b400 scan_local_mirror_snapshots: incomplete local non-primary snapshot
2021-02-15T10:54:51.659+0000 7f388ec82700 10 rbd::mirror::image_replayer::snapshot::Replayer: 0x55baf6d2b400 notify_status_updated:
2021-02-15T10:54:51.659+0000 7f389548f700 10 rbd::mirror::ImageReplayer: 0x55baf66c2280 [3/ab31b979-007c-49bc-8088-544ae5494ad1] handle_replayer_notification:
2021-02-15T10:54:51.659+0000 7f389548f700 10 rbd::mirror::ImageReplayer: 0x55baf66c2280 [3/ab31b979-007c-49bc-8088-544ae5494ad1] handle_replayer_notification: replay interrupted: r=-22, error=incomplete local non-primary snapshot
[1] https://pulpito.ceph.com/trociny-2021-02-15_09:32:56-rbd-wip-mgolub-testing-distro-basic-smithi/
[2] http://qa-proxy.ceph.com/teuthology/trociny-2021-02-15_09:32:56-rbd-wip-mgolub-testing-distro-basic-smithi/5883923/teuthology.log
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops -- completed_non_primary_snapshots_exist
was incorrectly being re-initialized to false within the loop.
Tweak the normal pruning behavior to ensure that an incomplete initial non-primary snapshot is not included in the prune set since we know it will be complete since otherwise the image would have been deleted due to not updating the mirror-image-state to enabled. Also ensure we cannot prune a non-primary mirror snapshot if we don't have a predecessor. Signed-off-by: Jason Dillaman <dillaman@redhat.com>
The creating state is a special case in rbd-mirror where it will automatically delete the image since it assumes it's malformed. A non-primary, snapshot-based mirror image needs to have at least one non-primary snapshot and the first one is not created until after replay has started. Now rbd-mirror will update the mirror image state to the enabled state after creating the first non-primary snapshot but before attempting the sync. Fixes: https://tracker.ceph.com/issues/49238 Signed-off-by: Jason Dillaman <dillaman@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox