Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pacific: mirror snapshot schedule and trash purge schedule fixes #46778

Merged
merged 6 commits into from Jul 5, 2022

Conversation

idryomov
Copy link
Contributor

…e logs

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit bd4af82)

Conflicts:
	src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit
	  e4a16e2 ("mgr/rbd_support: add type annotation") not in
	  pacific ]
	src/pybind/mgr/rbd_support/trash_purge_schedule.py [ ditto ]
…eduling

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 568345b)

Conflicts:
	src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit
	  e4a16e2 ("mgr/rbd_support: add type annotation") not in
	  pacific ]
	src/pybind/mgr/rbd_support/trash_purge_schedule.py [ ditto ]
Make refresh_pools() behave the same as refresh_images().

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 7d1e644)

Conflicts:
	src/pybind/mgr/rbd_support/trash_purge_schedule.py [ commit
	  e4a16e2 ("mgr/rbd_support: add type annotation") not in
	  pacific ]
The existing logic often leads to refresh_pools() and refresh_images()
being invoked after a 120 second delay instead of after an intended 60
second delay.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit ef3edd3)

Conflicts:
	src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit
	  e4a16e2 ("mgr/rbd_support: add type annotation") not in
	  pacific ]
	src/pybind/mgr/rbd_support/trash_purge_schedule.py [ ditto ]
If load_schedules() (i.e. periodic refresh) races with add_schedule()
invoked by the user for a fresh image, that image's schedule may get
lost until the next rebuild (not refresh!) of the queue:

1. periodic refresh invokes load_schedules()
2. load_schedules() creates a new Schedules instance and loads
   schedules from rbd_mirror_snapshot_schedule object
3. add_schedule() is invoked for a new image (an image that isn't
   present in self.images) by the user
4. before load_schedules() can grab self.lock, add_schedule() commits
   the new schedule to rbd_mirror_snapshot_schedule object and adds it
   to self.schedules
5. load_schedules() grabs self.lock and reassigns self.schedules with
   Schedules instance that is now stale
6. periodic refresh invokes load_pool_images() which discovers the new
   image; eventually it is added to self.images
7. periodic refresh invokes refresh_queue() which attempts to enqueue()
   the new image; this fails because a matching schedule isn't present

The next periodic refresh recovers the discarded schedule from
rbd_mirror_snapshot_schedule object but no attempt to enqueue() that
image is made since it is already "known" at that point.  Despite the
schedule being in place, no snapshots are created until the queue is
rebuilt from scratch or rbd_support module is reloaded.

To fix that, extend self.lock critical sections so that add_schedule()
and remove_schedule() can't get stepped on by load_schedules().

Fixes: https://tracker.ceph.com/issues/56090
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 95a0ec7)

Conflicts:
	src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit
	  e4a16e2 ("mgr/rbd_support: add type annotation") not in
	  pacific ]
	src/pybind/mgr/rbd_support/trash_purge_schedule.py [ ditto ]
Establishing a watch on rbd_mirroring object and skipping rescanning
image mirror snapshots on periodic refresh unless rbd_mirroring object
gets notified in the interim is flawed.  rbd_mirroring object is
notified when mirroring is enabled or disabled on some image (including
when the image is removed), but it is not notified when images are
promoted or demoted.  However, load_pool_images() discards images that
are not primary at the time of the scan.  If the image is promoted
later, no snapshots are created even if the schedule is in place.  This
happens regardless of whether the schedule is added before or after the
promotion.

This effectively reverts commit 69259c8 ("mgr/rbd_support: make
mirror_snapshot_schedule rescan only updated pools").  An alternative
fix could be to stop discarding non-primary images (i.e. drop

    if not info['primary']:
        continue

check added in commit d39eb28 ("mgr/rbd_support: mirror snapshot
schedule should skip non-primary images")), but that would clutter the
queue and therefore "rbd mirror snapshot schedule status" output with
bogus entries.  Performing a rescan roughly every 60 seconds should be
manageable: currently it amounts to a single mirror_image_status_list
request, followed by mirror_image_get, get_snapcontext and snapshot_get
requests for each snapshot-based mirroring enabled image and concluded
by a single dir_list request.  Among these, per-image get_snapcontext
and snapshot_get requests are necessary for determining primaryness.

Fixes: https://tracker.ceph.com/issues/53914
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
(cherry picked from commit 7fb4fdb)

Conflicts:
	src/pybind/mgr/rbd_support/mirror_snapshot_schedule.py [ commit
	  e4a16e2 ("mgr/rbd_support: add type annotation") not in
	  pacific ]
@yuriw yuriw merged commit 853c888 into ceph:pacific Jul 5, 2022
@idryomov idryomov deleted the wip-rbd-schedule-backports-pacific branch July 5, 2022 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants