-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nautilus: msg/async: connection race + winner fault can leave connection stuck at replacing foreve #27915
Conversation
@@ -164,7 +164,7 @@ Pipe::Pipe(SimpleMessenger *r, int st, PipeConnection *con) | |||
|
|||
randomize_out_seq(); | |||
|
|||
msgr->timeout = msgr->cct->_conf->ms_tcp_read_timeout * 1000; //convert to ms | |||
msgr->timeout = msgr->cct->_conf->ms_connection_idle_timeout * 1000; //convert to ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xiexingguo Need review here. The Pipe constructor still referring to ms_tcp_read_timeout conf and has to be renamed to ms_connection_idle_timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please pull in PR #28050 as well to avoid breaking RBD tests
@dillaman @xiexingguo Sure. Let me work on it. |
The old naming is confusing, e.g., it actually indicates we should tear down the underlying connection which has no read/write activities at both sides (namely connection is idle) for over 15 minutes. Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 1d46422) Conflicts: src/common/legacy_config_opts.h : Resolved for ms_connection_idle_timeout src/common/options.cc : Resolved for ms_connection_idle_timeout src/msg/simple/Pipe.cc : rename ms_tcp_read_timeout to ms_connection_idle_timeout
There could be various corner cases that may cause an async connection stuck in the connecting stage (e.g., by manually creating some loop back connections on the switches of our test cluster, we can almost 100% reproduce http://tracker.ceph.com/issues/37499). In 61b9432 I try to employ the existing keep_alive mechanism to get those stuck connections out of the trap but it does not work if the corresponding connection is not yet ready, since we always require the underlying connection to be **ready** in order to send out a keep_alive message. Fix by making a more general connecting timeout strategy. If a connecting process can not be finished within a specific interval, then we simply cut it off and retry. Fixes: http://tracker.ceph.com/issues/37499 Fixes: http://tracker.ceph.com/issues/38493 Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn> (cherry picked from commit 7209cc6) Conflicts: src/common/legacy_config_opts.h src/common/options.cc : Resolved for ms_connection_ready_timeout
Long running clients connected to thrashing OSDs could result in a "see no progress in more than <timeout>" message printed to stderr. This is not an error but can result in test failures when console output is compared against expected output. Fixes: http://tracker.ceph.com/issues/39448 Signed-off-by: Jason Dillaman <dillaman@redhat.com> (cherry picked from commit 6b4f972)
d536379
to
e52c645
Compare
@dillaman @xiexingguo Cherry-picked #28050. Kindly review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
http://tracker.ceph.com/issues/39241