nautilus: msg/async: connection race + winner fault can leave connection stuck at replacing foreve #27915

pdvian · 2019-05-02T03:57:53Z

http://tracker.ceph.com/issues/39241

pdvian · 2019-05-02T04:00:33Z

src/msg/simple/Pipe.cc

@@ -164,7 +164,7 @@ Pipe::Pipe(SimpleMessenger *r, int st, PipeConnection *con)

  randomize_out_seq();

-  msgr->timeout = msgr->cct->_conf->ms_tcp_read_timeout * 1000; //convert to ms
+  msgr->timeout = msgr->cct->_conf->ms_connection_idle_timeout * 1000; //convert to ms


@xiexingguo Need review here. The Pipe constructor still referring to ms_tcp_read_timeout conf and has to be renamed to ms_connection_idle_timeout.

dillaman

Please pull in PR #28050 as well to avoid breaking RBD tests

xiexingguo · 2019-06-04T08:53:47Z

@pdvian Can you cherry-pick in 6b4f972 too? Thanks!

pdvian · 2019-06-05T00:58:51Z

@dillaman @xiexingguo Sure. Let me work on it.

The old naming is confusing, e.g., it actually indicates we should tear down the underlying connection which has no read/write activities at both sides (namely connection is idle) for over 15 minutes. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 1d46422) Conflicts: src/common/legacy_config_opts.h : Resolved for ms_connection_idle_timeout src/common/options.cc : Resolved for ms_connection_idle_timeout src/msg/simple/Pipe.cc : rename ms_tcp_read_timeout to ms_connection_idle_timeout

There could be various corner cases that may cause an async connection stuck in the connecting stage (e.g., by manually creating some loop back connections on the switches of our test cluster, we can almost 100% reproduce http://tracker.ceph.com/issues/37499). In 61b9432 I try to employ the existing keep_alive mechanism to get those stuck connections out of the trap but it does not work if the corresponding connection is not yet ready, since we always require the underlying connection to be **ready** in order to send out a keep_alive message. Fix by making a more general connecting timeout strategy. If a connecting process can not be finished within a specific interval, then we simply cut it off and retry. Fixes: http://tracker.ceph.com/issues/37499 Fixes: http://tracker.ceph.com/issues/38493 Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 7209cc6) Conflicts: src/common/legacy_config_opts.h src/common/options.cc : Resolved for ms_connection_ready_timeout

Long running clients connected to thrashing OSDs could result in a "see no progress in more than <timeout>" message printed to stderr. This is not an error but can result in test failures when console output is compared against expected output. Fixes: http://tracker.ceph.com/issues/39448 Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 6b4f972)

pdvian · 2019-06-05T05:24:57Z

@dillaman @xiexingguo Cherry-picked #28050. Kindly review.

dillaman

👍

yuriw · 2019-06-14T19:35:29Z

wip-yuri3-testing-2019-06-14-1440-nautilus

pdvian commented May 2, 2019

View reviewed changes

smithfarm requested a review from xiexingguo May 2, 2019 07:59

xiexingguo approved these changes May 5, 2019

View reviewed changes

sebastian-philipp added this to the nautilus milestone May 13, 2019

dillaman suggested changes May 15, 2019

View reviewed changes

smithfarm added messenger Issues involving one of the Ceph messenger implementations rbd DNM labels May 20, 2019

xiexingguo and others added 3 commits June 4, 2019 23:18

pdvian force-pushed the wip-39241-nautilus branch from d536379 to e52c645 Compare June 5, 2019 05:23

xiexingguo approved these changes Jun 5, 2019

View reviewed changes

dillaman approved these changes Jun 13, 2019

View reviewed changes

dillaman added nautilus-batch-1 nautilus point releases needs-qa and removed DNM labels Jun 13, 2019

yuriw added the wip-yuri3-testing label Jun 14, 2019

yuriw merged commit 7282f59 into ceph:nautilus Jun 17, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nautilus: msg/async: connection race + winner fault can leave connection stuck at replacing foreve #27915

nautilus: msg/async: connection race + winner fault can leave connection stuck at replacing foreve #27915

pdvian commented May 2, 2019

pdvian May 2, 2019

dillaman left a comment

xiexingguo commented Jun 4, 2019

pdvian commented Jun 5, 2019

pdvian commented Jun 5, 2019

dillaman left a comment

yuriw commented Jun 14, 2019

nautilus: msg/async: connection race + winner fault can leave connection stuck at replacing foreve #27915

nautilus: msg/async: connection race + winner fault can leave connection stuck at replacing foreve #27915

Conversation

pdvian commented May 2, 2019

pdvian May 2, 2019

Choose a reason for hiding this comment

dillaman left a comment

Choose a reason for hiding this comment

xiexingguo commented Jun 4, 2019

pdvian commented Jun 5, 2019

pdvian commented Jun 5, 2019

dillaman left a comment

Choose a reason for hiding this comment

yuriw commented Jun 14, 2019