Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nautilus: msg/async: connection race + winner fault can leave connection stuck at replacing foreve #27915

Merged
merged 3 commits into from
Jun 17, 2019

Conversation

pdvian
Copy link

@pdvian pdvian commented May 2, 2019

@@ -164,7 +164,7 @@ Pipe::Pipe(SimpleMessenger *r, int st, PipeConnection *con)

randomize_out_seq();

msgr->timeout = msgr->cct->_conf->ms_tcp_read_timeout * 1000; //convert to ms
msgr->timeout = msgr->cct->_conf->ms_connection_idle_timeout * 1000; //convert to ms
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xiexingguo Need review here. The Pipe constructor still referring to ms_tcp_read_timeout conf and has to be renamed to ms_connection_idle_timeout.

@smithfarm smithfarm requested a review from xiexingguo May 2, 2019 07:59
@sebastian-philipp sebastian-philipp added this to the nautilus milestone May 13, 2019
Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pull in PR #28050 as well to avoid breaking RBD tests

@smithfarm smithfarm added messenger Issues involving one of the Ceph messenger implementations rbd DNM labels May 20, 2019
@xiexingguo
Copy link
Member

@pdvian Can you cherry-pick in 6b4f972 too? Thanks!

@pdvian
Copy link
Author

pdvian commented Jun 5, 2019

@dillaman @xiexingguo Sure. Let me work on it.

xiexingguo and others added 3 commits June 4, 2019 23:18
The old naming is confusing, e.g., it actually indicates we should tear
down the underlying connection which has no read/write activities
at both sides (namely connection is idle) for over 15 minutes.

Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 1d46422)

Conflicts:
	src/common/legacy_config_opts.h : Resolved for ms_connection_idle_timeout
	src/common/options.cc : Resolved for ms_connection_idle_timeout
	src/msg/simple/Pipe.cc : rename ms_tcp_read_timeout to ms_connection_idle_timeout
There could be various corner cases that may cause an async
connection stuck in the connecting stage (e.g., by manually
creating some loop back connections on the switches of our test cluster,
we can almost 100% reproduce http://tracker.ceph.com/issues/37499).

In 61b9432 I try to employ the
existing keep_alive mechanism to get those stuck connections out of the
trap but it does not work if the corresponding connection
is not yet ready, since we always require the underlying connection to be
**ready** in order to send out a keep_alive message.

Fix by making a more general connecting timeout strategy.
If a connecting process can not be finished within a specific interval,
then we simply cut it off and retry.

Fixes: http://tracker.ceph.com/issues/37499
Fixes: http://tracker.ceph.com/issues/38493
Signed-off-by: xie xingguo <xie.xingguo@zte.com.cn>
(cherry picked from commit 7209cc6)

Conflicts:
	src/common/legacy_config_opts.h
	src/common/options.cc : Resolved for ms_connection_ready_timeout
Long running clients connected to thrashing OSDs could result in a
"see no progress in more than <timeout>" message printed to stderr.
This is not an error but can result in test failures when console
output is compared against expected output.

Fixes: http://tracker.ceph.com/issues/39448
Signed-off-by: Jason Dillaman <dillaman@redhat.com>
(cherry picked from commit 6b4f972)
@pdvian
Copy link
Author

pdvian commented Jun 5, 2019

@dillaman @xiexingguo Cherry-picked #28050. Kindly review.

Copy link

@dillaman dillaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dillaman dillaman added nautilus-batch-1 nautilus point releases needs-qa and removed DNM labels Jun 13, 2019
@yuriw
Copy link
Contributor

yuriw commented Jun 14, 2019

@yuriw yuriw merged commit 7282f59 into ceph:nautilus Jun 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
messenger Issues involving one of the Ceph messenger implementations nautilus-batch-1 nautilus point releases needs-qa rbd wip-yuri3-testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants