Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr: die on bind() failure #20595

Merged
merged 1 commit into from Feb 28, 2018
Merged

mgr: die on bind() failure #20595

merged 1 commit into from Feb 28, 2018

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Feb 26, 2018

Previously, the daemon would get wedged if it
competed for the same port as another daemon
on the same host and lost.

Fixes: https://tracker.ceph.com/issues/23037
Signed-off-by: John Spray john.spray@redhat.com

Previously, the daemon would get wedged if it
competed for the same port as another daemon
on the same host and lost.

Fixes: https://tracker.ceph.com/issues/23037
Signed-off-by: John Spray <john.spray@redhat.com>
@wjwithagen
Copy link
Contributor

Why not do the same as with binding in MON and OSD.
Now systems that do not have systemd wll have to start wrapping mgr with something to keep autostarting it. :(

@jcsp
Copy link
Contributor Author

jcsp commented Feb 27, 2018

@wjwithagen I thought the mon and osd were also exiting when they had a bind failure?

@wjwithagen
Copy link
Contributor

@jcsp

I know there is, because FreeBSD and linux do not see eye to eye on socket reuse and the opetions used for that. So there is some code in
./src/msg/async/AsyncMessenger.cc:96
That part loops on bind() problems and tries several ports several times before giving up.

Otherwise restart MON and OSD would fail, due to socket still in use.

@jcsp
Copy link
Contributor Author

jcsp commented Feb 27, 2018

Since ceph-mgr is using the same underlying messenger code, it would get the same ms_bind_retry_count though, so I still don't see what the difference is between what mgr is doing and what mon/osd are doing?

@wjwithagen
Copy link
Contributor

@jcsp
Neither do I then. Both Simple as Async play this trick.

So this is more or less only possible when there is only one MGR port that is retried several times, after which the trying fails.
Or there are more ports and all of those were taken.
So it feels more like a configuration problem???

Let alone that this PR does end up doing what MON and OSD do in the end: abort and die.

@tchaikov tchaikov merged commit 55b4662 into ceph:master Feb 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants