New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
octopus: rbd-mirror: fix mirror image removal #43663
Conversation
jenkins test make check |
jenkins test api |
jenkins test make check |
1 similar comment
jenkins test make check |
jenkins test api |
jenkins test make check |
jenkins test make check |
1 similar comment
jenkins test make check |
The make check finally passed (was failing because of some python dependencies I think)! This needs the need-qa label AFAIU though. |
@MrFreezeex failures seems related can you take a look: |
Hmmm there are rbd-mirror failure related to killing the daemon, supposedly because it has already crashed. We probably need the rbd-mirror daemon logs to know why though. There are also some rbd failure not related to my change here, maybe that would be the reason? |
@MrFreezeex You may access the logs via http. E.g. here is the entry point for 6538462 job: Also, if you don't have access to the sepia lab yet (to run teuthology tests and review logs via ssh) I suggest to open a ticket at https://tracker.ceph.com/projects/lab/issues?set_filter=1&tracker_id=3 And I looked at one of failures. The rbd-mirror processes indeed crashed (there are core files), but strangely I have not found the corresponding back traces in the rbd-mirror logs. So we might need to use gdb to debug the core to see what was going on there. I did not do this because it one also needs corresponding binaries and libs... I will probably do this later if nobody will do this before me. |
Also, @MrFreezeex did you try to reproduce it locally? Because it would be the easiest way to track if it is reproduced locally. |
I will do that, thanks.
Not yet, I will try. |
Locally the snapshot workunit does succeeds, I have requested the access to the sepia lab, let's see... |
@MrFreezeex did you get a chance to get to this PR? |
Pls add |
Hey @ideepika, sorry for the late reply, I just came back from holydays. I checked and indeed like @trociny said there are coredump but no trace of crashes in rbd-mirror logs. I think the build is too old and somehow pruned though (https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=6390aa2b1e59f0681f52f65f10f8a4533a17c670) which means that I can't debug those. It would probably be useful to restart the teuthology test on this. I have access to sepia now but I am not in ceph org yet, so I can't do it right now but hopefully soon. |
sure, maybe I can trigger a build for you meanwhile |
jenkins test make check |
jenkins test api |
@MrFreezeex I think the change in your additional commit "rbd-mirror: fix TrashMoveRequest segfault" is correct, but I would add this change to the cherry-picked "rbd-mirror: remove mirror image at shut_down when there is no images" instead, where actually the issue was introduced, and add a "Conflicts" note there explaining this change, i.e. why we needed here and did not in the master (because in the master we do not explicitly delete the image context any more). |
Invoke ImageRemoveRequest instead of calling directly mirror_image_remove so that the MirrroringWatcher can pick up local image deletion. Fixes: https://tracker.ceph.com/issues/51031 Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit 34082b7) Conflicts: src/test/rbd_mirror/image_deleter/test_mock_TrashMoveRequest.cc - Trivial conflict resolution
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit 0e147f7) Conflicts: src/cls/rbd/cls_rbd_client.h - Trivial conflict resolution
In a scenario where you have rbd-mirror daemons on both clusters. The rbd-mirror daemon on the primary site will not properly cleanup his status on image removal. This commit add a path for direct removal at the shut_down of the ImageReplayer to properly cleanup the metadata. Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit a538c5d) Conflicts: src/test/rbd_mirror/test_mock_MirrorStatusUpdater.cc - Trivial conflict resolution; io_ctx exec has 1 less argument
Some cases makes the ImageReplayer to be eternally restarted if there is no local and remote images. If both images are absent and that the local image id exists, the ImageReplayer shutdown will request a mirror image removal. Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit 0c1c7fb) Conflicts: src/tools/rbd_mirror/image_deleter/TrashMoveRequest.cc Added a condition to handle the case where m_image_ctx is null on close_image and handle_close_image in the TrashMoveRequest. This fix is not needed in newer versions of Ceph as ImageCtx no longer needs to be destroyed explicitely with a destroy method after Octopus.
This make sure that all images are deleted in the existing qa scripts and checks if all rbd-mirror metadata in OMAP are correctly deleted. Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit 4db66da)
In the LoadRequest in the ImageMap class add initial cleanup to remove stale entries. To cleanup the LoadRequest will query the mirror image list and remove all the image_map that are notin the list. Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit e135403)
This prevent image_status_set to succeed when there is no mirror image yet. This solves some stale entries that were not removed in rbd-mirror and prevent to add entries that would not be visible from the rbd cli. Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit 416e257)
In some split-brain scenario the image is removed while the image_mapped is false. This prevents the removal of image_map in OMAP and thus the entry will not be removed until the daemon is restarted. Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit 35398a5)
If the image is being removed the PrepareRemoteImageRequest was returning the same error if the image was disabled or non primary which doesn't allow the BootstrapRequest to have the correct error handling. This commit fix this behavior by considering that the remote image is already deleted if the image is in disabling state. Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit ff60aec)
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit 965bc41) Conflicts: src/tools/rbd_mirror/image_replayer/PrepareLocalImageRequest.cc - Trivial conflict resolution; s/lirbd::asio:://
In some cases, set_state is called with DISSOCIATING, then ASSOCIATING and DISSOCIATING again. In this case the state DISSOCIATING is processed to remove the image and then schedule the next action which is associating. To fix this case, this commit removes the next_state if the state is sets to the same state. Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit b664a95)
Try fixing sporadic failure linked in the tracker in TestMockMirrorStatusUpdater.RemoveImmediateUpdate by making it synchronous. Fixes: https://tracker.ceph.com/issues/53375 Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch> (cherry picked from commit 9385acf)
8ab582f
to
8891af5
Compare
Done! I also rebased the PR, hopefully it will help with the make check failure in jenkins. For the record here is the teuthology run before applying the fix https://pulpito.ceph.com/mrfreezeex-2022-01-07_21:07:03-rbd:mirror-wip-53031-MrFreezeex-octopus-distro-default-smithi/ and after the fix http://pulpito.front.sepia.ceph.com/mrfreezeex-2022-01-11_16:57:26-rbd:mirror-wip-53031-MrFreezeex-octopus-distro-default-smithi/ so it should work properly with octopus now! |
backport tracker: https://tracker.ceph.com/issues/53031
backport of #41696
parent tracker: https://tracker.ceph.com/issues/51031
backport tracker: https://tracker.ceph.com/issues/53387
backport of #44064
parent tracker: https://tracker.ceph.com/issues/53375
this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh