Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nautilus: rbd-mirror: image replayer stop might race with instance replayer shut down #41792

Merged
merged 5 commits into from
Jun 18, 2021

Conversation

trociny
Copy link
Contributor

@trociny trociny commented Jun 9, 2021

backport tracker: https://tracker.ceph.com/issues/45275
backport tracker: https://tracker.ceph.com/issues/45764


backport of #34615
parent tracker: https://tracker.ceph.com/issues/45072

backport of #3493
parent tracker: https://tracker.ceph.com/issues/45716

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh

This wraps the functionality of starting and finishing a tracked op
into the standard context interface.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 4bd9d15)
@trociny trociny added this to the nautilus milestone Jun 9, 2021
@trociny trociny added the rbd label Jun 9, 2021
@trociny
Copy link
Contributor Author

trociny commented Jun 9, 2021

Cherry-picking 64f8d9c was skipped because it adds changes (switch to common C_TrackedOp context class) to the code that does not exist in nautilus.

@yuriw
Copy link
Contributor

yuriw commented Jun 9, 2021

test this please

The shut down waits for in-flight ops to complete but the
start/stop/restart operations were previously not tracked. This
could cause a potential race and crash between an image replayer
operation and the instance replayer shutting down.

Fixes: https://tracker.ceph.com/issues/45072
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 31140a9)

Conflicts:
	src/tools/rbd_mirror/InstanceReplayer.cc:
                Mutex::Locker vs std::lock_guard,
                m_local_rados->cct() vs m_local_io_ctx.cct(),
                no stop(Context *on_finish) function.
@idryomov
Copy link
Contributor

idryomov commented Jun 9, 2021

What about follow-up fixes in #34931? Are they needed?

@trociny
Copy link
Contributor Author

trociny commented Jun 10, 2021

What about follow-up fixes in #34931? Are they needed?

Sure. I was going to add them here after making sure this part passes jenkins test.

@trociny trociny force-pushed the wip-45275-nautilus branch 2 times, most recently from 13e73b0 to 423ecb1 Compare June 10, 2021 06:06
@trociny
Copy link
Contributor Author

trociny commented Jun 10, 2021

The backport of #34931 is added.

src/tools/rbd_mirror/ImageReplayer.cc Outdated Show resolved Hide resolved
@@ -842,14 +853,14 @@ void ImageReplayer<I>::handle_replay_ready()
template <typename I>
void ImageReplayer<I>::restart(Context *on_finish)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original commit adds setting of m_restart_requested under m_lock here. Why is it omitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. I lost this when resolving the conflict. Should be ok now.

Previously, if stop was issued when restart was at "stopping"
stage, the stop was just ignored.

Signed-off-by: Mykola Golub <mgolub@suse.com>
(cherry picked from commit 0a3794e)

Conflicts:
	src/tools/rbd_mirror/ImageReplayer.cc (FunctionContext vs LambdaContext,
	                                       update stop's args in handle_remote_journal_metadata_updated)
	src/tools/rbd_mirror/ImageReplayer.h (Mutex vs ceph::mutex)
when stopping instance replayer on shut down.

Signed-off-by: Mykola Golub <mgolub@suse.com>
(cherry picked from commit e55b64e)

Conflicts:
	src/tools/rbd_mirror/InstanceReplayer.cc (no on_finish arg for stop())
@trociny
Copy link
Contributor Author

trociny commented Jun 10, 2021

@idryomov Thanks. Updated

@idryomov
Copy link
Contributor

idryomov commented Jun 15, 2021

http://qa-proxy.ceph.com/teuthology/yuriw-2021-06-14_16:00:31-rbd-wip-yuri-testing-2021-06-14-0729-nautilus-distro-basic-smithi/6171891/teuthology.log

2021-06-14T20:42:46.208 INFO:tasks.rbd_mirror_thrash:kill cluster2.client.mirror.2
2021-06-14T20:42:46.209 INFO:tasks.rbd_mirror.cluster2.client.mirror.2:Sent signal 15
2021-06-14T20:42:46.209 INFO:tasks.rbd_mirror_thrash:waiting for 3 secs before reviving daemons
2021-06-14T20:42:46.210 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr:2021-06-14 20:42:46.210 7f6966a82700 -1 received  signal: Terminated from /usr/bin/python3 /usr/bin/daemon-helper term rbd-mirror --foreground --cluster cluster2 --id mirror.2  (PID: 261655) UID: 1000
2021-06-14T20:42:49.210 INFO:tasks.rbd_mirror_thrash:waiting for cluster2.client.mirror.2
2021-06-14T20:42:49.211 INFO:teuthology.orchestra.run:waiting for 600

...

2021-06-14T20:52:53.397 INFO:tasks.rbd_mirror_thrash:Failed to stop cluster2.client.mirror.2
2021-06-14T20:52:53.398 INFO:tasks.rbd_mirror.cluster2.client.mirror.2:Sent signal 6
2021-06-14T20:52:53.398 ERROR:tasks.rbd_mirror_thrash:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_efaec7927e949dc1e9e7f068e4f86265596ffab6/qa/tasks/rbd_mirror_thrash.py", line 84, in _run
    self.do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_efaec7927e949dc1e9e7f068e4f86265596ffab6/qa/tasks/rbd_mirror_thrash.py", line 153, in do_thrash
    run.wait([daemon.proc], timeout=600)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_f359b10daba6e0103d42ccfc021bc797f3cd7edc/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_f359b10daba6e0103d42ccfc021bc797f3cd7edc/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (100) after waiting for 600 seconds
2021-06-14T20:52:53.399 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr:*** Caught signal (Aborted) **
2021-06-14T20:52:53.399 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: in thread 7f69788171c0 thread_name:rbd-mirror
2021-06-14T20:52:53.425 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: ceph version 14.2.21-364-gefaec7927e9 (efaec7927e949dc1e9e7f068e4f86265596ffab6) nautilus (stable)
2021-06-14T20:52:53.426 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 1: (()+0xf630) [0x7f696d7f1630]
2021-06-14T20:52:53.426 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 2: (pthread_cond_wait()+0xc5) [0x7f696d7eda35]
2021-06-14T20:52:53.426 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 3: (rbd::mirror::InstanceReplayer<librbd::ImageCtx>::shut_down()+0x197) [0x55d59d86ec77]
2021-06-14T20:52:53.427 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 4: (rbd::mirror::PoolReplayer<librbd::ImageCtx>::shut_down()+0x96) [0x55d59d830a06]
2021-06-14T20:52:53.427 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 5: (rbd::mirror::PoolReplayer<librbd::ImageCtx>::~PoolReplayer()+0xa4) [0x55d59d834cd4]
2021-06-14T20:52:53.427 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 6: (std::_Rb_tree<std::pair<long, rbd::mirror::PeerSpec>, std::pair<std::pair<long, rbd::mirror::PeerSpec> const, std::unique_ptr<rbd::mirror::PoolReplayer<librbd::ImageCtx>, std::default_delete<rbd::mirror::PoolReplayer<librbd::ImageCtx> > > >, std::_Select1st<std::pair<std::pair<long, rbd::mirror::PeerSpec> const, std::unique_ptr<rbd::mirror::PoolReplayer<librbd::ImageCtx>, std::default_delete<rbd::mirror::PoolReplayer<librbd::ImageCtx> > > > >, std::less<std::pair<long, rbd::mirror::PeerSpec> >, std::allocator<std::pair<std::pair<long, rbd::mirror::PeerSpec> const, std::unique_ptr<rbd::mirror::PoolReplayer<librbd::ImageCtx>, std::default_delete<rbd::mirror::PoolReplayer<librbd::ImageCtx> > > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::pair<long, rbd::mirror::PeerSpec> const, std::unique_ptr<rbd::mirror::PoolReplayer<librbd::ImageCtx>, std::default_delete<rbd::mirror::PoolReplayer<librbd::ImageCtx> > > > >*)+0x3f) [0x55d59d82bfef]
2021-06-14T20:52:53.427 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 7: (rbd::mirror::Mirror::~Mirror()+0xaf) [0x55d59d8278df]
2021-06-14T20:52:53.428 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 8: (main()+0x3a8) [0x55d59d8171c8]
2021-06-14T20:52:53.428 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 9: (__libc_start_main()+0xf5) [0x7f696bd94555]
2021-06-14T20:52:53.428 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 10: (()+0x215090) [0x55d59d824090]
2021-06-14T20:52:53.430 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr:2021-06-14 20:52:53.430 7f69788171c0 -1 *** Caught signal (Aborted) **
2021-06-14T20:52:53.430 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: in thread 7f69788171c0 thread_name:rbd-mirror

@idryomov
Copy link
Contributor

I commented here thinking rbd-mirror crashed during shutdown, but it actually hung and was aborted on purpose. Rerun expectedly succeeded.

@trociny Can you take a look to make sure the hang is not related? Apart from this PR, #41787 and #41788 were included.

@trociny
Copy link
Contributor Author

trociny commented Jun 15, 2021

The hang looks related. It hanged in InstanceReplayer on shutdown, "waiting for in-flight start/stop/restart", i.e. exactly at the backported code. Right now I have no idea about the cause but I am still looking.

@trociny
Copy link
Contributor Author

trociny commented Jun 15, 2021

The hang looks related. It hanged in InstanceReplayer on shutdown, "waiting for in-flight start/stop/restart", i.e. exactly at the backported code. Right now I have no idea about the cause but I am still looking.

Actually, the hang may be not related (it might have been waiting for some other tracked op, not one that was added in this PR). But still it looks highly suspicious. And I have failed to track why it could get stuck so far.

I am going to continue to investigate this tomorrow when I have time. But if we are shot in time (for release) I think we can just exclude this PR from the release (not closing, probably merging after the release). The initial issue does not look critical -- the assertion could fail on the rbd-mirror shutdown, so users probably not even notice it if they hit it.

@yuriw
Copy link
Contributor

yuriw commented Jun 17, 2021

@trociny any updates on this ?

@trociny
Copy link
Contributor Author

trociny commented Jun 17, 2021

I tracked the issue to the bug in ImageReplayer<I>handle_start_replay. If we cancel an image replay start when it is in start_replay, then in handle_start_replay the on_replay_interrupted() [2] will return true and we just return, without completing m_on_start_finish context, so this "start" remains pending in tracked operations and on shut down we are waiting for it to complete forever.

So it is not a bug in the backport but it reveals a bug in nautilus. In the newer versions ImageReplayer<I>handle_start_replay does not has this bug (we call on_start_interrupted() as the first step, which handles this case). Now I am trying to figure out if there is a commit that I just can cherry-pick or if it should be a direct fix to nautilus.

[1] https://github.com/ceph/ceph/blob/nautilus/src/tools/rbd_mirror/ImageReplayer.cc#L635
[2] https://github.com/ceph/ceph/blob/nautilus/src/tools/rbd_mirror/ImageReplayer.cc#L664
[3] https://github.com/ceph/ceph/blob/master/src/tools/rbd_mirror/ImageReplayer.cc#L427

@trociny
Copy link
Contributor Author

trociny commented Jun 17, 2021

I believe it was accidentally fixed during refactoring in 0d36eb5, which we can't backport. So it should be a direct bug fix commit.

@yuriw
Copy link
Contributor

yuriw commented Jun 17, 2021

@trociny I will start 14.2.22 testing and we will decide on this when you are ready, thx !

if (r < 0) {
if (on_start_interrupted()) {
return;
} if (r < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens to work because of return, but better to change to else if.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Sure! Thanks!

@idryomov
Copy link
Contributor

jenkins test make check

It fixes the bug when the handle_start_replay detected the cancel
when it called on_replay_interrupted and returned without
completing m_on_start_finish context.

This is a direct commit to nautilus. The bug was accidentally
fixed in newer versions during refactoring.

Signed-off-by: Mykola Golub <mgolub@suse.com>
@trociny
Copy link
Contributor Author

trociny commented Jun 18, 2021

Updated.

Here are teuthology results for a test subset (--filter rbd/mirror-thrash --limit 20) [1] (no failures). It is with the previous patch though it should not make any difference. For comparison the results for the same subset without the patch [2] (some tests failed).

[1] https://pulpito.ceph.com/trociny-2021-06-17_16:39:29-rbd-wip-mgolub-testing-nautilus-distro-basic-smithi/
[2] https://pulpito.ceph.com/trociny-2021-06-17_08:21:42-rbd-wip-mgolub-testing-nautilus-distro-basic-smithi/

@yuriw yuriw merged commit 22c2801 into ceph:nautilus Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants