New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pacific: mirror snapshot schedule and trash purge schedule fixes #46778
Merged
yuriw
merged 6 commits into
ceph:pacific
from
idryomov:wip-rbd-schedule-backports-pacific
Jul 5, 2022
Merged
pacific: mirror snapshot schedule and trash purge schedule fixes #46778
yuriw
merged 6 commits into
ceph:pacific
from
idryomov:wip-rbd-schedule-backports-pacific
Jul 5, 2022
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The existing logic often leads to refresh_pools() and refresh_images() being invoked after a 120 second delay instead of after an intended 60 second delay. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> (cherry picked from commit ef3edd3) Conflicts: src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit e4a16e2 ("mgr/rbd_support: add type annotation") not in pacific ] src/pybind/mgr/rbd_support/trash_purge_schedule.py [ ditto ]
If load_schedules() (i.e. periodic refresh) races with add_schedule() invoked by the user for a fresh image, that image's schedule may get lost until the next rebuild (not refresh!) of the queue: 1. periodic refresh invokes load_schedules() 2. load_schedules() creates a new Schedules instance and loads schedules from rbd_mirror_snapshot_schedule object 3. add_schedule() is invoked for a new image (an image that isn't present in self.images) by the user 4. before load_schedules() can grab self.lock, add_schedule() commits the new schedule to rbd_mirror_snapshot_schedule object and adds it to self.schedules 5. load_schedules() grabs self.lock and reassigns self.schedules with Schedules instance that is now stale 6. periodic refresh invokes load_pool_images() which discovers the new image; eventually it is added to self.images 7. periodic refresh invokes refresh_queue() which attempts to enqueue() the new image; this fails because a matching schedule isn't present The next periodic refresh recovers the discarded schedule from rbd_mirror_snapshot_schedule object but no attempt to enqueue() that image is made since it is already "known" at that point. Despite the schedule being in place, no snapshots are created until the queue is rebuilt from scratch or rbd_support module is reloaded. To fix that, extend self.lock critical sections so that add_schedule() and remove_schedule() can't get stepped on by load_schedules(). Fixes: https://tracker.ceph.com/issues/56090 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> (cherry picked from commit 95a0ec7) Conflicts: src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit e4a16e2 ("mgr/rbd_support: add type annotation") not in pacific ] src/pybind/mgr/rbd_support/trash_purge_schedule.py [ ditto ]
Establishing a watch on rbd_mirroring object and skipping rescanning image mirror snapshots on periodic refresh unless rbd_mirroring object gets notified in the interim is flawed. rbd_mirroring object is notified when mirroring is enabled or disabled on some image (including when the image is removed), but it is not notified when images are promoted or demoted. However, load_pool_images() discards images that are not primary at the time of the scan. If the image is promoted later, no snapshots are created even if the schedule is in place. This happens regardless of whether the schedule is added before or after the promotion. This effectively reverts commit 69259c8 ("mgr/rbd_support: make mirror_snapshot_schedule rescan only updated pools"). An alternative fix could be to stop discarding non-primary images (i.e. drop if not info['primary']: continue check added in commit d39eb28 ("mgr/rbd_support: mirror snapshot schedule should skip non-primary images")), but that would clutter the queue and therefore "rbd mirror snapshot schedule status" output with bogus entries. Performing a rescan roughly every 60 seconds should be manageable: currently it amounts to a single mirror_image_status_list request, followed by mirror_image_get, get_snapcontext and snapshot_get requests for each snapshot-based mirroring enabled image and concluded by a single dir_list request. Among these, per-image get_snapcontext and snapshot_get requests are necessary for determining primaryness. Fixes: https://tracker.ceph.com/issues/53914 Signed-off-by: Ilya Dryomov <idryomov@gmail.com> (cherry picked from commit 7fb4fdb) Conflicts: src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit e4a16e2 ("mgr/rbd_support: add type annotation") not in pacific ]
ideepika
approved these changes
Jun 21, 2022
ideepika
approved these changes
Jun 21, 2022
sunnyku
approved these changes
Jun 21, 2022
trociny
approved these changes
Jun 22, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
backport trackers:
https://tracker.ceph.com/issues/56143
https://tracker.ceph.com/issues/56144
backport of #46734 and #46743
parent trackers:
https://tracker.ceph.com/issues/56090
https://tracker.ceph.com/issues/53914