New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
quincy: msg/async: don't abort when public addrs mismatch bind addrs #50575
Conversation
|
@rzarzynski I found some related failures. Here is one example: Several more jobs that also have a similar symptom are: And more on the link. |
|
It looks the reef suite branch has been used to verify the quincy backport (see the PR for |
Before the 69b47c8 (PR ceph#50153) a mismatch (in number or types of stored `entity_addr_t`) between public addrs and bind addrs vectors was ignored and the former was taking over anything else -- it was possible to e.g. bind to both v1 and v2 addresses but expose v2 only. Unfortunately, that's exactly how Rook configures ceph-mon: ``` debug 2023-03-16T21:01:48.389+0000 7f99822bf8c0 0 starting mon.a rank 0 at public addrs v2:172.30.122.144:3300/0 at bind addrs [v2:10.129.2.21:3300/0,v1:10.129.2.21:6789/0] mon_data /var/lib/ceph/mon/ceph-a fsid acc14d1b-fb2b-4f01-8b61-6e7cb26e9200 ``` The consequnece is the following abort: ``` ceph version 17.2.5-1338.el9cp (5adce3015143c7c2cc135a71368be194744f5761) quincy (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd3) [0x7ff05392c1a6] 2: /usr/lib64/ceph/libceph-common.so.2(+0x165ac1) [0x7ff05394eac1] 3: (AsyncMessenger::bindv(entity_addrvec_t const&, std::optional<entity_addrvec_t>)+0x1fe) [0x7ff053baa0ce] 4: main() 5: /lib64/libc.so.6(+0x3feb0) [0x7ff053048eb0] 6: __libc_start_main() 7: _start() debug *** Caught signal (Aborted) ** in thread 7ff052ab18c0 thread_name:ceph-mon 2023-03-16T09:56:35.995+0000 7ff052ab18c0 -1 /builddir/build/BUILD/ceph-17.2.5-1338-g484e8dbb/src/msg/msg_types.h: In function 'void entity_addr_t::set_port(int)' thread 7ff052ab18c0 time 2023-03-16T09:56:35.996339+0000 /builddir/build/BUILD/ceph-17.2.5-1338-g484e8dbb/src/msg/msg_types.h: 359: ceph_abort_msg("abort() called") ``` This commit brings the original logic back but in a way that preserves the port numbers figured out by. e.g. `Processor::bind`. Fixes: https://tracker.ceph.com/issues/59100 Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com> (cherry picked from commit 68fbdcf)
9cbe8f0
to
3fb8104
Compare
|
Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri3-testing-2023-03-22-1123-quincy Failures, unrelated: Details: |
backport tracker: https://tracker.ceph.com/issues/59101
backport of #50574
parent tracker: https://tracker.ceph.com/issues/59100
this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh