nautilus: rbd-mirror: image replayer stop might race with instance replayer shut down #41792

trociny · 2021-06-09T16:30:46Z

backport tracker: https://tracker.ceph.com/issues/45275
backport tracker: https://tracker.ceph.com/issues/45764

backport of #34615
parent tracker: https://tracker.ceph.com/issues/45072

backport of #3493
parent tracker: https://tracker.ceph.com/issues/45716

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/master/src/script/ceph-backport.sh

This wraps the functionality of starting and finishing a tracked op into the standard context interface. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 4bd9d15)

trociny · 2021-06-09T16:32:52Z

Cherry-picking 64f8d9c was skipped because it adds changes (switch to common C_TrackedOp context class) to the code that does not exist in nautilus.

yuriw · 2021-06-09T17:19:16Z

test this please

The shut down waits for in-flight ops to complete but the start/stop/restart operations were previously not tracked. This could cause a potential race and crash between an image replayer operation and the instance replayer shutting down. Fixes: https://tracker.ceph.com/issues/45072 Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 31140a9) Conflicts: src/tools/rbd_mirror/InstanceReplayer.cc: Mutex::Locker vs std::lock_guard, m_local_rados->cct() vs m_local_io_ctx.cct(), no stop(Context *on_finish) function.

idryomov · 2021-06-09T21:22:19Z

What about follow-up fixes in #34931? Are they needed?

trociny · 2021-06-10T05:26:57Z

What about follow-up fixes in #34931? Are they needed?

Sure. I was going to add them here after making sure this part passes jenkins test.

trociny · 2021-06-10T08:27:18Z

The backport of #34931 is added.

src/tools/rbd_mirror/ImageReplayer.cc

idryomov · 2021-06-10T11:27:28Z

src/tools/rbd_mirror/ImageReplayer.cc

@@ -842,14 +853,14 @@ void ImageReplayer<I>::handle_replay_ready()
 template <typename I>
 void ImageReplayer<I>::restart(Context *on_finish)
 {


The original commit adds setting of m_restart_requested under m_lock here. Why is it omitted?

Oh, sorry. I lost this when resolving the conflict. Should be ok now.

Previously, if stop was issued when restart was at "stopping" stage, the stop was just ignored. Signed-off-by: Mykola Golub <mgolub@suse.com> (cherry picked from commit 0a3794e) Conflicts: src/tools/rbd_mirror/ImageReplayer.cc (FunctionContext vs LambdaContext, update stop's args in handle_remote_journal_metadata_updated) src/tools/rbd_mirror/ImageReplayer.h (Mutex vs ceph::mutex)

when stopping instance replayer on shut down. Signed-off-by: Mykola Golub <mgolub@suse.com> (cherry picked from commit e55b64e) Conflicts: src/tools/rbd_mirror/InstanceReplayer.cc (no on_finish arg for stop())

trociny · 2021-06-10T12:43:42Z

@idryomov Thanks. Updated

idryomov · 2021-06-15T10:13:08Z

http://qa-proxy.ceph.com/teuthology/yuriw-2021-06-14_16:00:31-rbd-wip-yuri-testing-2021-06-14-0729-nautilus-distro-basic-smithi/6171891/teuthology.log

2021-06-14T20:42:46.208 INFO:tasks.rbd_mirror_thrash:kill cluster2.client.mirror.2
2021-06-14T20:42:46.209 INFO:tasks.rbd_mirror.cluster2.client.mirror.2:Sent signal 15
2021-06-14T20:42:46.209 INFO:tasks.rbd_mirror_thrash:waiting for 3 secs before reviving daemons
2021-06-14T20:42:46.210 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr:2021-06-14 20:42:46.210 7f6966a82700 -1 received  signal: Terminated from /usr/bin/python3 /usr/bin/daemon-helper term rbd-mirror --foreground --cluster cluster2 --id mirror.2  (PID: 261655) UID: 1000
2021-06-14T20:42:49.210 INFO:tasks.rbd_mirror_thrash:waiting for cluster2.client.mirror.2
2021-06-14T20:42:49.211 INFO:teuthology.orchestra.run:waiting for 600

...

2021-06-14T20:52:53.397 INFO:tasks.rbd_mirror_thrash:Failed to stop cluster2.client.mirror.2
2021-06-14T20:52:53.398 INFO:tasks.rbd_mirror.cluster2.client.mirror.2:Sent signal 6
2021-06-14T20:52:53.398 ERROR:tasks.rbd_mirror_thrash:exception:
Traceback (most recent call last):
  File "/home/teuthworker/src/github.com_ceph_ceph-c_efaec7927e949dc1e9e7f068e4f86265596ffab6/qa/tasks/rbd_mirror_thrash.py", line 84, in _run
    self.do_thrash()
  File "/home/teuthworker/src/github.com_ceph_ceph-c_efaec7927e949dc1e9e7f068e4f86265596ffab6/qa/tasks/rbd_mirror_thrash.py", line 153, in do_thrash
    run.wait([daemon.proc], timeout=600)
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_f359b10daba6e0103d42ccfc021bc797f3cd7edc/teuthology/orchestra/run.py", line 473, in wait
    check_time()
  File "/home/teuthworker/src/git.ceph.com_git_teuthology_f359b10daba6e0103d42ccfc021bc797f3cd7edc/teuthology/contextutil.py", line 133, in __call__
    raise MaxWhileTries(error_msg)
teuthology.exceptions.MaxWhileTries: reached maximum tries (100) after waiting for 600 seconds
2021-06-14T20:52:53.399 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr:*** Caught signal (Aborted) **
2021-06-14T20:52:53.399 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: in thread 7f69788171c0 thread_name:rbd-mirror
2021-06-14T20:52:53.425 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: ceph version 14.2.21-364-gefaec7927e9 (efaec7927e949dc1e9e7f068e4f86265596ffab6) nautilus (stable)
2021-06-14T20:52:53.426 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 1: (()+0xf630) [0x7f696d7f1630]
2021-06-14T20:52:53.426 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 2: (pthread_cond_wait()+0xc5) [0x7f696d7eda35]
2021-06-14T20:52:53.426 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 3: (rbd::mirror::InstanceReplayer<librbd::ImageCtx>::shut_down()+0x197) [0x55d59d86ec77]
2021-06-14T20:52:53.427 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 4: (rbd::mirror::PoolReplayer<librbd::ImageCtx>::shut_down()+0x96) [0x55d59d830a06]
2021-06-14T20:52:53.427 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 5: (rbd::mirror::PoolReplayer<librbd::ImageCtx>::~PoolReplayer()+0xa4) [0x55d59d834cd4]
2021-06-14T20:52:53.427 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 6: (std::_Rb_tree<std::pair<long, rbd::mirror::PeerSpec>, std::pair<std::pair<long, rbd::mirror::PeerSpec> const, std::unique_ptr<rbd::mirror::PoolReplayer<librbd::ImageCtx>, std::default_delete<rbd::mirror::PoolReplayer<librbd::ImageCtx> > > >, std::_Select1st<std::pair<std::pair<long, rbd::mirror::PeerSpec> const, std::unique_ptr<rbd::mirror::PoolReplayer<librbd::ImageCtx>, std::default_delete<rbd::mirror::PoolReplayer<librbd::ImageCtx> > > > >, std::less<std::pair<long, rbd::mirror::PeerSpec> >, std::allocator<std::pair<std::pair<long, rbd::mirror::PeerSpec> const, std::unique_ptr<rbd::mirror::PoolReplayer<librbd::ImageCtx>, std::default_delete<rbd::mirror::PoolReplayer<librbd::ImageCtx> > > > > >::_M_erase(std::_Rb_tree_node<std::pair<std::pair<long, rbd::mirror::PeerSpec> const, std::unique_ptr<rbd::mirror::PoolReplayer<librbd::ImageCtx>, std::default_delete<rbd::mirror::PoolReplayer<librbd::ImageCtx> > > > >*)+0x3f) [0x55d59d82bfef]
2021-06-14T20:52:53.427 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 7: (rbd::mirror::Mirror::~Mirror()+0xaf) [0x55d59d8278df]
2021-06-14T20:52:53.428 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 8: (main()+0x3a8) [0x55d59d8171c8]
2021-06-14T20:52:53.428 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 9: (__libc_start_main()+0xf5) [0x7f696bd94555]
2021-06-14T20:52:53.428 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: 10: (()+0x215090) [0x55d59d824090]
2021-06-14T20:52:53.430 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr:2021-06-14 20:52:53.430 7f69788171c0 -1 *** Caught signal (Aborted) **
2021-06-14T20:52:53.430 INFO:tasks.rbd_mirror.cluster2.client.mirror.2.smithi174.stderr: in thread 7f69788171c0 thread_name:rbd-mirror

idryomov · 2021-06-15T10:46:33Z

I commented here thinking rbd-mirror crashed during shutdown, but it actually hung and was aborted on purpose. Rerun expectedly succeeded.

@trociny Can you take a look to make sure the hang is not related? Apart from this PR, #41787 and #41788 were included.

trociny · 2021-06-15T16:24:53Z

The hang looks related. It hanged in InstanceReplayer on shutdown, "waiting for in-flight start/stop/restart", i.e. exactly at the backported code. Right now I have no idea about the cause but I am still looking.

trociny · 2021-06-15T17:34:08Z

The hang looks related. It hanged in InstanceReplayer on shutdown, "waiting for in-flight start/stop/restart", i.e. exactly at the backported code. Right now I have no idea about the cause but I am still looking.

Actually, the hang may be not related (it might have been waiting for some other tracked op, not one that was added in this PR). But still it looks highly suspicious. And I have failed to track why it could get stuck so far.

I am going to continue to investigate this tomorrow when I have time. But if we are shot in time (for release) I think we can just exclude this PR from the release (not closing, probably merging after the release). The initial issue does not look critical -- the assertion could fail on the rbd-mirror shutdown, so users probably not even notice it if they hit it.

yuriw · 2021-06-17T14:35:19Z

@trociny any updates on this ?

trociny · 2021-06-17T14:49:50Z

I tracked the issue to the bug in ImageReplayer<I>handle_start_replay. If we cancel an image replay start when it is in start_replay, then in handle_start_replay the on_replay_interrupted() [2] will return true and we just return, without completing m_on_start_finish context, so this "start" remains pending in tracked operations and on shut down we are waiting for it to complete forever.

So it is not a bug in the backport but it reveals a bug in nautilus. In the newer versions ImageReplayer<I>handle_start_replay does not has this bug (we call on_start_interrupted() as the first step, which handles this case). Now I am trying to figure out if there is a commit that I just can cherry-pick or if it should be a direct fix to nautilus.

[1] https://github.com/ceph/ceph/blob/nautilus/src/tools/rbd_mirror/ImageReplayer.cc#L635
[2] https://github.com/ceph/ceph/blob/nautilus/src/tools/rbd_mirror/ImageReplayer.cc#L664
[3] https://github.com/ceph/ceph/blob/master/src/tools/rbd_mirror/ImageReplayer.cc#L427

trociny · 2021-06-17T14:53:39Z

I believe it was accidentally fixed during refactoring in 0d36eb5, which we can't backport. So it should be a direct bug fix commit.

yuriw · 2021-06-17T19:58:19Z

@trociny I will start 14.2.22 testing and we will decide on this when you are ready, thx !

idryomov · 2021-06-17T19:58:26Z

src/tools/rbd_mirror/ImageReplayer.cc

-  if (r < 0) {
+  if (on_start_interrupted()) {
+    return;
+  } if (r < 0) {


This happens to work because of return, but better to change to else if.

Ah. Sure! Thanks!

idryomov · 2021-06-17T20:01:05Z

jenkins test make check

It fixes the bug when the handle_start_replay detected the cancel when it called on_replay_interrupted and returned without completing m_on_start_finish context. This is a direct commit to nautilus. The bug was accidentally fixed in newer versions during refactoring. Signed-off-by: Mykola Golub <mgolub@suse.com>

trociny · 2021-06-18T05:41:49Z

Updated.

Here are teuthology results for a test subset (--filter rbd/mirror-thrash --limit 20) [1] (no failures). It is with the previous patch though it should not make any difference. For comparison the results for the same subset without the patch [2] (some tests failed).

[1] https://pulpito.ceph.com/trociny-2021-06-17_16:39:29-rbd-wip-mgolub-testing-nautilus-distro-basic-smithi/
[2] https://pulpito.ceph.com/trociny-2021-06-17_08:21:42-rbd-wip-mgolub-testing-nautilus-distro-basic-smithi/

common: add helper C_TrackerOp context class

476b718

This wraps the functionality of starting and finishing a tracked op into the standard context interface. Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 4bd9d15)

trociny added this to the nautilus milestone Jun 9, 2021

trociny added the rbd label Jun 9, 2021

trociny force-pushed the wip-45275-nautilus branch from b2477b4 to b070d57 Compare June 9, 2021 17:52

trociny force-pushed the wip-45275-nautilus branch from b070d57 to e4d083d Compare June 9, 2021 18:26

idryomov added the bug-fix label Jun 9, 2021

trociny force-pushed the wip-45275-nautilus branch 2 times, most recently from 13e73b0 to 423ecb1 Compare June 10, 2021 06:06

idryomov reviewed Jun 10, 2021

View reviewed changes

trociny added 2 commits June 10, 2021 15:00

rbd-mirror: wait for in-flight start/stop/restart

55f88d6

when stopping instance replayer on shut down. Signed-off-by: Mykola Golub <mgolub@suse.com> (cherry picked from commit e55b64e) Conflicts: src/tools/rbd_mirror/InstanceReplayer.cc (no on_finish arg for stop())

trociny force-pushed the wip-45275-nautilus branch from 423ecb1 to 55f88d6 Compare June 10, 2021 12:01

idryomov approved these changes Jun 10, 2021

View reviewed changes

idryomov added nautilus-batch-1 nautilus point releases needs-qa labels Jun 10, 2021

yuriw added the wip-yuri-testing label Jun 14, 2021

idryomov reviewed Jun 17, 2021

View reviewed changes

trociny force-pushed the wip-45275-nautilus branch from 612f9ba to d47ddd9 Compare June 18, 2021 05:36

idryomov approved these changes Jun 18, 2021

View reviewed changes

yuriw merged commit 22c2801 into ceph:nautilus Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nautilus: rbd-mirror: image replayer stop might race with instance replayer shut down #41792

nautilus: rbd-mirror: image replayer stop might race with instance replayer shut down #41792

trociny commented Jun 9, 2021 •

edited

Loading

trociny commented Jun 9, 2021

yuriw commented Jun 9, 2021

idryomov commented Jun 9, 2021

trociny commented Jun 10, 2021

trociny commented Jun 10, 2021

idryomov Jun 10, 2021

trociny Jun 10, 2021

trociny commented Jun 10, 2021

idryomov commented Jun 15, 2021 •

edited

Loading

idryomov commented Jun 15, 2021

trociny commented Jun 15, 2021

trociny commented Jun 15, 2021

yuriw commented Jun 17, 2021

trociny commented Jun 17, 2021

trociny commented Jun 17, 2021 •

edited

Loading

yuriw commented Jun 17, 2021

idryomov Jun 17, 2021

trociny Jun 18, 2021

idryomov commented Jun 17, 2021

trociny commented Jun 18, 2021

nautilus: rbd-mirror: image replayer stop might race with instance replayer shut down #41792

nautilus: rbd-mirror: image replayer stop might race with instance replayer shut down #41792

Conversation

trociny commented Jun 9, 2021 • edited Loading

trociny commented Jun 9, 2021

yuriw commented Jun 9, 2021

idryomov commented Jun 9, 2021

trociny commented Jun 10, 2021

trociny commented Jun 10, 2021

idryomov Jun 10, 2021

Choose a reason for hiding this comment

trociny Jun 10, 2021

Choose a reason for hiding this comment

trociny commented Jun 10, 2021

idryomov commented Jun 15, 2021 • edited Loading

idryomov commented Jun 15, 2021

trociny commented Jun 15, 2021

trociny commented Jun 15, 2021

yuriw commented Jun 17, 2021

trociny commented Jun 17, 2021

trociny commented Jun 17, 2021 • edited Loading

yuriw commented Jun 17, 2021

idryomov Jun 17, 2021

Choose a reason for hiding this comment

trociny Jun 18, 2021

Choose a reason for hiding this comment

idryomov commented Jun 17, 2021

trociny commented Jun 18, 2021

trociny commented Jun 9, 2021 •

edited

Loading

idryomov commented Jun 15, 2021 •

edited

Loading

trociny commented Jun 17, 2021 •

edited

Loading