New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async/rdma: compile with rdma as default #13901

Merged
merged 1 commit into from Mar 9, 2017

Conversation

Projects
None yet
5 participants
@Adirl

Adirl commented Mar 9, 2017

Signed-off-by: DanielBar-On danielbo@mellanox.com

msg/async/rdma: compile with rdma as default
Issue: 992583
Issue: 992580

Change-Id: I24128b87294d3083c44c934f7d4bed554dd1f8a4
Signed-off-by: DanielBar-On <danielbo@mellanox.com>
@Adirl

This comment has been minimized.

Adirl commented Mar 9, 2017

@liewegas
Following our discussion, see PR for compiling ceph with RDMA by default.
verified on native centos7.2 and ubuntu14.04
install-deps.sh will use apt-get/yum to get libibverbs as needed.

Thanks
Adir

@Adirl

This comment has been minimized.

Adirl commented Mar 9, 2017

@liewegas liewegas added the build/ops label Mar 9, 2017

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Mar 9, 2017

it's excited that we could see rdma builtin

@liewegas liewegas merged commit f5dfa07 into ceph:master Mar 9, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details
@Adirl

This comment has been minimized.

Adirl commented Mar 9, 2017

Thank you
Thank you

@hwchiu

This comment has been minimized.

hwchiu commented Mar 10, 2017

Hi, I want to know do we need to change the systemd config for RDMA just like #13305 after this change?

Thanks.

@Adirl

This comment has been minimized.

Adirl commented Mar 11, 2017

If you want to use systemctl or ceph-deploy to bring up and control your cluster so the answer is yes.
#13305 is not merged yet so you will have to apply it manually. We are working changes to get it merged to main branch.

@hwchiu

This comment has been minimized.

hwchiu commented Mar 14, 2017

Hi, I try to use this version's ceph with RDMA and envounter some problem.

After I set up the ceph.conf adn install the ceph, I use some build-in command such as ceph, rados and it crashes.

The following is my note and I hope that will be useful.

How I install the ceph

  1. ceph-deploy install --dev=wip-rdma-build-test --dev-commit=6f5e6e984e19e5ad33552fbb65033c91dbb7a36c ceph-1
  2. the content of ceph.conf is here

What problem I meet.

  1. Segmentation fault when I execute the script ceph
  2. Not only ceph, but rados lspools crash.

What do I do for those problem.
I try to use the gdb to find the problem and it shows me that someing wrong on

void Device::binding_port(CephContext *cct, uint8_t port_num) {
--->	port_cnt = device_attr->phys_port_cntfor (uint8_t i = 0; i < port_cnt; ++i) {
	Port *port = new Port(cct, ctxt, i+1);
...

You can see the full gdb content here.

Thanks.

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Mar 14, 2017

@Adirl could we deploy rdma via ceph-deploy? I guess this may relate to device priviledge ?

@Adirl

This comment has been minimized.

Adirl commented Mar 14, 2017

Instead of ens4 try putting the driver name in ceph.conf
for instance: ms_async_rdma_device_name=mlx5_0

here is an example how to find it for ConnectX-4:
$ cat /sys/class/net/ens4/device/infiniband/mlx5_0/ports/*/state
4: ACTIVE

@Adirl

This comment has been minimized.

Adirl commented Mar 14, 2017

If you have OFED installed you can run: ibdev2netdev

@hwchiu

This comment has been minimized.

hwchiu commented Mar 14, 2017

Thanks your help and I will try to replace the device name.

@hwchiu

This comment has been minimized.

hwchiu commented Mar 15, 2017

Hi,

Thanks your help and I still have some problems.

After change the ms_async_rdma_device_name to mlx4_0, I can successfully run the command ceph, but but I meet another two crashes for two commands
1.rados lspools
2.ceph rados osd tree

rados lspools

For command rados lspools, It will crash after it shows the result.
For example

hwchiu@ceph-1:~/cluster$ sudo rados lspools
rbd
Segmentation fault (core dumped)

The gdb content is (here)[https://gist.github.com/hwchiu/eadc75c6582588db3a4a8f1faf70f70a], it crash after the librados::RadosClient::connect.

ceph rados osd tree

For command ceph rados osd tree, the process will abort since the assert.

/build/ceph-12.0.0-1287-g6f5e6e98/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: In function 'void RDMAConnectedSocketImpl::handle_connection()' thread 7ff4395eb700 time 2017-03-15 15:24:33.581693
/build/ceph-12.0.0-1287-g6f5e6e98/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: 218: FAILED assert(!r)
2017-03-15 15:24:33.581678 7ff4395eb700 -1  RDMAConnectedSocketImpl activate failed to transition to RTR state: (22) Invalid argument
2017-03-15 15:24:33.581863 7ff438dea700 -1  RDMAConnectedSocketImpl activate failed to transition to RTR state: (22) Invalid argument
/build/ceph-12.0.0-1287-g6f5e6e98/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: In function 'void RDMAConnectedSocketImpl::handle_connection()' thread 7ff438dea700 time 2017-03-15 15:24:33.581878
/build/ceph-12.0.0-1287-g6f5e6e98/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: 218: FAILED assert(!r)
 ceph version 12.0.0-1287-g6f5e6e98 (6f5e6e984e19e5ad33552fbb65033c91dbb7a36c)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x102) [0x7ff454a3bcb2]
 2: (RDMAConnectedSocketImpl::handle_connection()+0xb4a) [0x7ff454b8536a]
 3: (EventCenter::process_events(int)+0x9b1) [0x7ff454b6d5d1]
 4: (()+0x3ba091) [0x7ff454b72091]
 5: (()+0xb8c80) [0x7ff4542e6c80]
 6: (()+0x76ba) [0x7ff4663a16ba]
 7: (clone()+0x6d) [0x7ff4660d782d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Aborted (core dumped)

I try to install different OFED versions but both MLNX_OFED_LINUX-3.4-2.0.0.0 and MLNX_OFED_LINUX-4.0-1.0.1.0 still have the problem.

The ceph.conf is same as the I posed before and only the ms_async_rdma_device_name change to mlx4_0.

Thanks your help again.

@hwchiu

This comment has been minimized.

hwchiu commented Mar 15, 2017

Please ignore above message
I have fixed my problem and that is caused by the wrong linked library of librados.

Thanks.

@Adirl Adirl deleted the Adirl:default branch Apr 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment