-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nautilus: rbd-mirror: image replayer stop might race with instance replayer shut down #41792
Conversation
This wraps the functionality of starting and finishing a tracked op into the standard context interface. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 4bd9d15)
Cherry-picking 64f8d9c was skipped because it adds changes (switch to common C_TrackedOp context class) to the code that does not exist in nautilus. |
test this please |
b2477b4
to
b070d57
Compare
The shut down waits for in-flight ops to complete but the start/stop/restart operations were previously not tracked. This could cause a potential race and crash between an image replayer operation and the instance replayer shutting down. Fixes: https://tracker.ceph.com/issues/45072 Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 31140a9) Conflicts: src/tools/rbd_mirror/InstanceReplayer.cc: Mutex::Locker vs std::lock_guard, m_local_rados->cct() vs m_local_io_ctx.cct(), no stop(Context *on_finish) function.
b070d57
to
e4d083d
Compare
What about follow-up fixes in #34931? Are they needed? |
Sure. I was going to add them here after making sure this part passes jenkins test. |
13e73b0
to
423ecb1
Compare
The backport of #34931 is added. |
@@ -842,14 +853,14 @@ void ImageReplayer<I>::handle_replay_ready() | |||
template <typename I> | |||
void ImageReplayer<I>::restart(Context *on_finish) | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original commit adds setting of m_restart_requested
under m_lock
here. Why is it omitted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry. I lost this when resolving the conflict. Should be ok now.
Previously, if stop was issued when restart was at "stopping" stage, the stop was just ignored. Signed-off-by: Mykola Golub <mgolub@suse.com> (cherry picked from commit 0a3794e) Conflicts: src/tools/rbd_mirror/ImageReplayer.cc (FunctionContext vs LambdaContext, update stop's args in handle_remote_journal_metadata_updated) src/tools/rbd_mirror/ImageReplayer.h (Mutex vs ceph::mutex)
when stopping instance replayer on shut down. Signed-off-by: Mykola Golub <mgolub@suse.com> (cherry picked from commit e55b64e) Conflicts: src/tools/rbd_mirror/InstanceReplayer.cc (no on_finish arg for stop())
423ecb1
to
55f88d6
Compare
@idryomov Thanks. Updated |
|
The hang looks related. It hanged in |
Actually, the hang may be not related (it might have been waiting for some other tracked op, not one that was added in this PR). But still it looks highly suspicious. And I have failed to track why it could get stuck so far. I am going to continue to investigate this tomorrow when I have time. But if we are shot in time (for release) I think we can just exclude this PR from the release (not closing, probably merging after the release). The initial issue does not look critical -- the assertion could fail on the rbd-mirror shutdown, so users probably not even notice it if they hit it. |
@trociny any updates on this ? |
I tracked the issue to the bug in So it is not a bug in the backport but it reveals a bug in nautilus. In the newer versions [1] https://github.com/ceph/ceph/blob/nautilus/src/tools/rbd_mirror/ImageReplayer.cc#L635 |
I believe it was accidentally fixed during refactoring in 0d36eb5, which we can't backport. So it should be a direct bug fix commit. |
@trociny I will start 14.2.22 testing and we will decide on this when you are ready, thx ! |
if (r < 0) { | ||
if (on_start_interrupted()) { | ||
return; | ||
} if (r < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This happens to work because of return
, but better to change to else if
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. Sure! Thanks!
jenkins test make check |
It fixes the bug when the handle_start_replay detected the cancel when it called on_replay_interrupted and returned without completing m_on_start_finish context. This is a direct commit to nautilus. The bug was accidentally fixed in newer versions during refactoring. Signed-off-by: Mykola Golub <mgolub@suse.com>
612f9ba
to
d47ddd9
Compare
Updated. Here are teuthology results for a test subset ( [1] https://pulpito.ceph.com/trociny-2021-06-17_16:39:29-rbd-wip-mgolub-testing-nautilus-distro-basic-smithi/ |
backport tracker: https://tracker.ceph.com/issues/45275
backport tracker: https://tracker.ceph.com/issues/45764
backport of #34615
parent tracker: https://tracker.ceph.com/issues/45072
backport of #3493
parent tracker: https://tracker.ceph.com/issues/45716
this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh