Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

octopus: rbd-mirror: fix mirror image removal #43663

Merged
merged 12 commits into from Jan 20, 2022

Conversation

MrFreezeex
Copy link
Member

@MrFreezeex MrFreezeex commented Oct 26, 2021

backport tracker: https://tracker.ceph.com/issues/53031
backport of #41696
parent tracker: https://tracker.ceph.com/issues/51031


backport tracker: https://tracker.ceph.com/issues/53387
backport of #44064
parent tracker: https://tracker.ceph.com/issues/53375


this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh

@github-actions github-actions bot added this to the octopus milestone Oct 26, 2021
@MrFreezeex
Copy link
Member Author

jenkins test make check

@MrFreezeex
Copy link
Member Author

jenkins test api

@MrFreezeex
Copy link
Member Author

jenkins test make check

1 similar comment
@MrFreezeex
Copy link
Member Author

jenkins test make check

@MrFreezeex
Copy link
Member Author

jenkins test api

@MrFreezeex
Copy link
Member Author

jenkins test make check

@MrFreezeex
Copy link
Member Author

jenkins test make check

1 similar comment
@MrFreezeex
Copy link
Member Author

jenkins test make check

@MrFreezeex
Copy link
Member Author

MrFreezeex commented Nov 2, 2021

The make check finally passed (was failing because of some python dependencies I think)! This needs the need-qa label AFAIU though.

@MrFreezeex
Copy link
Member Author

@trociny Do you think I should add #44064 here?

@trociny
Copy link
Contributor

trociny commented Nov 24, 2021

@trociny Do you think I should add #44064 here?

Yes, it will work perfectly. Just update the description to include "backport of"t/"racker ..." links.

@ideepika
Copy link
Member

ideepika commented Dec 9, 2021

@MrFreezeex
Copy link
Member Author

Hmmm there are rbd-mirror failure related to killing the daemon, supposedly because it has already crashed. We probably need the rbd-mirror daemon logs to know why though. There are also some rbd failure not related to my change here, maybe that would be the reason?

@trociny
Copy link
Contributor

trociny commented Dec 9, 2021

@MrFreezeex You may access the logs via http. E.g. here is the entry point for 6538462 job:

http://qa-proxy.ceph.com/teuthology/yuriw-2021-12-01_17:21:38-rbd-wip-yuri7-testing-2021-11-30-0847-octopus-distro-default-smithi/6538462/

Also, if you don't have access to the sepia lab yet (to run teuthology tests and review logs via ssh) I suggest to open a ticket at https://tracker.ceph.com/projects/lab/issues?set_filter=1&tracker_id=3

And I looked at one of failures. The rbd-mirror processes indeed crashed (there are core files), but strangely I have not found the corresponding back traces in the rbd-mirror logs. So we might need to use gdb to debug the core to see what was going on there. I did not do this because it one also needs corresponding binaries and libs... I will probably do this later if nobody will do this before me.

@trociny
Copy link
Contributor

trociny commented Dec 9, 2021

Also, @MrFreezeex did you try to reproduce it locally? Because it would be the easiest way to track if it is reproduced locally.

@MrFreezeex
Copy link
Member Author

Also, if you don't have access to the sepia lab yet (to run teuthology tests and review logs via ssh) I suggest to open a ticket at https://tracker.ceph.com/projects/lab/issues?set_filter=1&tracker_id=3

I will do that, thanks.

Also, @MrFreezeex did you try to reproduce it locally? Because it would be the easiest way to track if it is reproduced locally.

Not yet, I will try.

@MrFreezeex
Copy link
Member Author

Locally the snapshot workunit does succeeds, I have requested the access to the sepia lab, let's see...

@ideepika
Copy link
Member

ideepika commented Jan 4, 2022

Locally the snapshot workunit does succeeds, I have requested the access to the sepia lab, let's see...

@MrFreezeex did you get a chance to get to this PR?

@yuriw
Copy link
Contributor

yuriw commented Jan 4, 2022

Pls add needs-qa when all is fixed ref: https://trello.com/c/41G4eFjp

@MrFreezeex
Copy link
Member Author

@MrFreezeex did you get a chance to get to this PR?

Hey @ideepika, sorry for the late reply, I just came back from holydays.

I checked and indeed like @trociny said there are coredump but no trace of crashes in rbd-mirror logs. I think the build is too old and somehow pruned though (https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=6390aa2b1e59f0681f52f65f10f8a4533a17c670) which means that I can't debug those.

It would probably be useful to restart the teuthology test on this. I have access to sepia now but I am not in ceph org yet, so I can't do it right now but hopefully soon.

@ideepika
Copy link
Member

ideepika commented Jan 6, 2022

@MrFreezeex did you get a chance to get to this PR?

Hey @ideepika, sorry for the late reply, I just came back from holydays.

I checked and indeed like @trociny said there are coredump but no trace of crashes in rbd-mirror logs. I think the build is too old and somehow pruned though (https://shaman.ceph.com/api/search/?status=ready&project=ceph&flavor=default&distros=centos%2F8%2Fx86_64&sha1=6390aa2b1e59f0681f52f65f10f8a4533a17c670) which means that I can't debug those.

It would probably be useful to restart the teuthology test on this. I have access to sepia now but I am not in ceph org yet, so I can't do it right now but hopefully soon.

sure, maybe I can trigger a build for you meanwhile

@ideepika
Copy link
Member

ideepika commented Jan 6, 2022

https://shaman.ceph.com/builds/ceph/wip-53031-MrFreezeex-octopus/

@MrFreezeex
Copy link
Member Author

jenkins test make check

@MrFreezeex
Copy link
Member Author

jenkins test api

@trociny
Copy link
Contributor

trociny commented Jan 12, 2022

@MrFreezeex I think the change in your additional commit "rbd-mirror: fix TrashMoveRequest segfault" is correct, but I would add this change to the cherry-picked "rbd-mirror: remove mirror image at shut_down when there is no images" instead, where actually the issue was introduced, and add a "Conflicts" note there explaining this change, i.e. why we needed here and did not in the master (because in the master we do not explicitly delete the image context any more).

Invoke ImageRemoveRequest instead of calling directly
mirror_image_remove so that the MirrroringWatcher can pick up local
image deletion.

Fixes: https://tracker.ceph.com/issues/51031
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit 34082b7)

 Conflicts:
	src/test/rbd_mirror/image_deleter/test_mock_TrashMoveRequest.cc
- Trivial conflict resolution
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit 0e147f7)

 Conflicts:
	src/cls/rbd/cls_rbd_client.h
- Trivial conflict resolution
In a scenario where you have rbd-mirror daemons on both clusters. The
rbd-mirror daemon on the primary site will not properly cleanup his
status on image removal.

This commit add a path for direct removal at the shut_down of the
ImageReplayer to properly cleanup the metadata.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit a538c5d)

Conflicts:
        src/test/rbd_mirror/test_mock_MirrorStatusUpdater.cc
- Trivial conflict resolution; io_ctx exec has 1 less argument
Some cases makes the ImageReplayer to be eternally restarted if there is
no local and remote images.

If both images are absent and that the local image id exists, the
ImageReplayer shutdown will request a mirror image removal.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit 0c1c7fb)

 Conflicts:
	src/tools/rbd_mirror/image_deleter/TrashMoveRequest.cc

Added a condition to handle the case where m_image_ctx is null on
close_image and handle_close_image in the TrashMoveRequest. This fix is
not needed in newer versions of Ceph as ImageCtx no longer needs to be
destroyed explicitely with a destroy method after Octopus.
This make sure that all images are deleted in the existing qa scripts
and checks if all rbd-mirror metadata in OMAP are correctly deleted.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit 4db66da)
In the LoadRequest in the ImageMap class add initial cleanup to remove
stale entries. To cleanup the LoadRequest will query the mirror image
list and remove all the image_map that are notin the list.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit e135403)
This prevent image_status_set to succeed when there is no mirror image
yet. This solves some stale entries that were not removed in
rbd-mirror and prevent to add entries that would not be visible from the
rbd cli.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit 416e257)
In some split-brain scenario the image is removed while the image_mapped
is false. This prevents the removal of image_map in OMAP and thus the
entry will not be removed until the daemon is restarted.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit 35398a5)
If the image is being removed the PrepareRemoteImageRequest was
returning the same error if the image was disabled or non primary which
doesn't allow the BootstrapRequest to have the correct error handling.

This commit fix this behavior by considering that the remote image is
already deleted if the image is in disabling state.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit ff60aec)
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit 965bc41)
Conflicts:
        src/tools/rbd_mirror/image_replayer/PrepareLocalImageRequest.cc
- Trivial conflict resolution; s/lirbd::asio:://
In some cases, set_state is called with DISSOCIATING, then ASSOCIATING
and DISSOCIATING again. In this case the state DISSOCIATING is
processed to remove the image and then schedule the next action which is
associating.

To fix this case, this commit removes the next_state if the state is
sets to the same state.

Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit b664a95)
Try fixing sporadic failure linked in the tracker in
TestMockMirrorStatusUpdater.RemoveImmediateUpdate by making it
synchronous.

Fixes: https://tracker.ceph.com/issues/53375
Signed-off-by: Arthur Outhenin-Chalandre <arthur.outhenin-chalandre@cern.ch>
(cherry picked from commit 9385acf)
@MrFreezeex
Copy link
Member Author

@MrFreezeex I think the change in your additional commit "rbd-mirror: fix TrashMoveRequest segfault" is correct, but I would add this change to the cherry-picked "rbd-mirror: remove mirror image at shut_down when there is no images" instead, where actually the issue was introduced, and add a "Conflicts" note there explaining this change, i.e. why we needed here and did not in the master (because in the master we do not explicitly delete the image context any more).

Done! I also rebased the PR, hopefully it will help with the make check failure in jenkins.

For the record here is the teuthology run before applying the fix https://pulpito.ceph.com/mrfreezeex-2022-01-07_21:07:03-rbd:mirror-wip-53031-MrFreezeex-octopus-distro-default-smithi/ and after the fix http://pulpito.front.sepia.ceph.com/mrfreezeex-2022-01-11_16:57:26-rbd:mirror-wip-53031-MrFreezeex-octopus-distro-default-smithi/ so it should work properly with octopus now!

@yuriw yuriw merged commit b849fa8 into ceph:octopus Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants