Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

quincy: msg/async: don't abort when public addrs mismatch bind addrs #50575

Merged
merged 1 commit into from Mar 23, 2023

Conversation

rzarzynski
Copy link
Contributor

backport tracker: https://tracker.ceph.com/issues/59101


backport of #50574
parent tracker: https://tracker.ceph.com/issues/59100

this backport was staged using ceph-backport.sh version 16.0.0.6848
find the latest version at https://github.com/ceph/ceph/blob/main/src/script/ceph-backport.sh

@ljflores
Copy link
Contributor

ljflores commented Mar 21, 2023

@rzarzynski I found some related failures. Here is one example:

https://pulpito.ceph.com/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/

/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213295/teuthology.log

2023-03-21T15:30:19.827 DEBUG:teuthology.orchestra.run.smithi089:> adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage ceph --cluster ceph osd dump --format=json
2023-03-21T15:30:19.916 INFO:tasks.ceph.osd.1.smithi089.stderr:2023-03-21T15:30:19.915+0000 7f5dde0ee540 -1 Falling back to public interface
2023-03-21T15:30:19.919 INFO:tasks.ceph.osd.2.smithi089.stderr:2023-03-21T15:30:19.918+0000 7f9411748540 -1 Falling back to public interface
2023-03-21T15:30:19.924 INFO:tasks.ceph.osd.0.smithi089.stderr:2023-03-21T15:30:19.923+0000 7fa7f4f67540 -1 Falling back to public interface
/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213295/remote/smithi089/log/ceph-osd.1.log.gz

2023-03-21T15:30:19.914+0000 7f5dde0ee540 10 bluestore(/var/lib/ceph/osd/ceph-1) get_numa_node bdev nvme0n1 on numa_node 0
2023-03-21T15:30:19.914+0000 7f5dde0ee540  1  objectstore numa_node 0
2023-03-21T15:30:19.914+0000 7f5dde0ee540  0 starting osd.1 osd_data /var/lib/ceph/osd/ceph-1 /var/lib/ceph/osd/ceph-1/journal
2023-03-21T15:30:19.915+0000 7f5dde0ee540 10 picked public_addrs [v2:0.0.0.0:0/0,v1:0.0.0.0:0/0]
2023-03-21T15:30:19.915+0000 7f5dde0ee540 10 there is no public_bind_addrs, defaulting to public_addrs
2023-03-21T15:30:19.915+0000 7f5dde0ee540 -1 Falling back to public interface
2023-03-21T15:30:19.915+0000 7f5dde0ee540  1 -- [v2:0.0.0.0:6800/109346,v1:0.0.0.0:6801/109346] _finish_bind bind my_addrs is [v2:0.0.0.0:6800/109346,v1:0.0.0.0:6801/109346]
2023-03-21T15:30:19.915+0000 7f5dde0ee540  1 -- [v2:0.0.0.0:6802/109346,v1:0.0.0.0:6803/109346] _finish_bind bind my_addrs is [v2:0.0.0.0:6802/109346,v1:0.0.0.0:6803/109346]
2023-03-21T15:30:19.915+0000 7f5dde0ee540  1 -- [v2:0.0.0.0:6804/109346,v1:0.0.0.0:6805/109346] _finish_bind bind my_addrs is [v2:0.0.0.0:6804/109346,v1:0.0.0.0:6805/109346]
2023-03-21T15:30:19.915+0000 7f5dde0ee540  1 -- [v2:0.0.0.0:6806/109346,v1:0.0.0.0:6807/109346] _finish_bind bind my_addrs is [v2:0.0.0.0:6806/109346,v1:0.0.0.0:6807/109346]
2023-03-21T15:30:19.922+0000 7f5dde0ee540  0 load: jerasure load: lrc
2023-03-21T15:30:19.922+0000 7f5dde0ee540  1 bdev(0x5555c3742000 /var/lib/ceph/osd/ceph-1/block) open path /var/lib/ceph/osd/ceph-1/block

Several more jobs that also have a similar symptom are:
/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213012
/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213015
/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213046
/a/yuriw-2023-03-18_00:57:11-rados-wip-yuri3-testing-2023-03-17-1235-quincy-distro-default-smithi/7213072

And more on the link.

@rzarzynski
Copy link
Contributor Author

It looks the reef suite branch has been used to verify the quincy backport (see the PR for main for details). As we already need to retest, I'm pushing a new commit with the very sole difference being removal of the assert (for absolute clarity: no more changes; just this one).

Before the 69b47c8 (PR ceph#50153)
a mismatch (in number or types of stored `entity_addr_t`) between
public addrs and bind addrs vectors was ignored and the former
was taking over anything else -- it was possible to e.g. bind to
both v1 and v2 addresses but expose v2 only. Unfortunately, that's
exactly how Rook configures ceph-mon:

```
debug 2023-03-16T21:01:48.389+0000 7f99822bf8c0  0 starting mon.a rank 0 at public addrs v2:172.30.122.144:3300/0 at bind addrs [v2:10.129.2.21:3300/0,v1:10.129.2.21:6789/0] mon_data /var/lib/ceph/mon/ceph-a fsid acc14d1b-fb2b-4f01-8b61-6e7cb26e9200
```

The consequnece is the following abort:

```
ceph version 17.2.5-1338.el9cp (5adce3015143c7c2cc135a71368be194744f5761) quincy (stable)
1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)+0xd3) [0x7ff05392c1a6]
2: /usr/lib64/ceph/libceph-common.so.2(+0x165ac1) [0x7ff05394eac1]
3: (AsyncMessenger::bindv(entity_addrvec_t const&, std::optional<entity_addrvec_t>)+0x1fe) [0x7ff053baa0ce]
4: main()
5: /lib64/libc.so.6(+0x3feb0) [0x7ff053048eb0]
6: __libc_start_main()
7: _start()
debug *** Caught signal (Aborted) **  in thread 7ff052ab18c0 thread_name:ceph-mon 2023-03-16T09:56:35.995+0000 7ff052ab18c0 -1 /builddir/build/BUILD/ceph-17.2.5-1338-g484e8dbb/src/msg/msg_types.h: In function 'void entity_addr_t::set_port(int)' thread 7ff052ab18c0 time 2023-03-16T09:56:35.996339+0000 /builddir/build/BUILD/ceph-17.2.5-1338-g484e8dbb/src/msg/msg_types.h: 359: ceph_abort_msg("abort() called")
```

This commit brings the original logic back but in a way that
preserves the port numbers figured out by. e.g. `Processor::bind`.

Fixes: https://tracker.ceph.com/issues/59100
Signed-off-by: Radosław Zarzyński <rzarzyns@redhat.com>
(cherry picked from commit 68fbdcf)
@ljflores
Copy link
Contributor

Rados suite review: https://pulpito.ceph.com/?branch=wip-yuri3-testing-2023-03-22-1123-quincy

Failures, unrelated:
1. https://tracker.ceph.com/issues/58585
2. https://tracker.ceph.com/issues/56000
3. https://tracker.ceph.com/issues/58560
4. https://tracker.ceph.com/issues/58475
5. https://tracker.ceph.com/issues/59080

Details:
1. rook: failed to pull kubelet image - Ceph - Orchestrator
2. task/test_nfs: ERROR: Daemon not found: mds.a.smithi060.ujwxef. See cephadm ls - Ceph - Orchestrator
3. test_envlibrados_for_rocksdb.sh failed to subscribe to repo - Infrastructure
4. test_dashboard_e2e.sh: Conflicting peer dependency: postcss@8.4.21 - Ceph - Mgr - Dashboard
5. mclock-config.sh: TEST_profile_disallow_builtin_params_modify fails when $res == $opt_val_new - Ceph - RADOS

@ljflores ljflores merged commit dec366b into ceph:quincy Mar 23, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants