New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async/rdma: cleanup #13509

Merged
merged 3 commits into from Feb 27, 2017

Conversation

Projects
None yet
4 participants
@yuyuyu101
Member

yuyuyu101 commented Feb 18, 2017

Now in osd side, async msgr will copy rx buffer then return it to kernel.
when tx, it also need to copy tx buffer into registered tx buffers.

This commit make rx buffer copy into the registered tx buffer, so when
send message it can avoid extra tx buffer copy.

This mainly solve osd side multi tx buffer copy problem

Signed-off-by: Haomai Wang haomai@xsky.com

@Adirl

This comment has been minimized.

Adirl commented Feb 19, 2017

reviewing

@Adirl

This comment has been minimized.

Adirl commented Feb 23, 2017

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 23, 2017

related to this pr? @Adirl

@Adirl

This comment has been minimized.

Adirl commented Feb 23, 2017

@yuyuyu101
probably setup issue, sorry
updates to come

@Adirl

This comment has been minimized.

Adirl commented Feb 26, 2017

Hi @yuyuyu101
I continue my debug, here's a quick update on my status:
cluster is up with 3 OSD's
running:
[root@r-dcs69 ceph]# rados bench -p rbd 1 write --no-cleanup 1 16 16 0 0 0 - 0
no object written:

[root@r-dcs69 ceph]# ceph  df
GLOBAL:
    SIZE      AVAIL     RAW USED     %RAW USED
    1166G     1151G       15461M          1.29
POOLS:
    NAME     ID     USED     %USED     MAX AVAIL     OBJECTS
    rbd      0         0         0          107G           0

log shows:

[root@r-dcs69 ceph]# cat /var/log/ceph/ceph-osd.0.log  | grep get_registered_memory
2017-02-26 09:55:58.811539 7fe7154e8700 20 RDMAStack get_registered_memory need 4194304, got 4194304(512 chunks)
2017-02-26 09:55:58.832965 7fe7154e8700 20 RDMAStack get_registered_memory need 4194304, got 0(0 chunks)
2017-02-26 09:56:23.758501 7fe7154e8700 20 RDMAStack get_registered_memory need 4194304, got 0(0 chunks)
2017-02-26 09:56:33.761281 7fe7154e8700 20 RDMAStack get_registered_memory need 4194304, got 0(0 chunks)

counters shows (great counters !!!):

[root@r-dcs69 ceph]#  ceph daemon osd.0 perf dump |grep register
        "rx_no_registered_mem": 0,
        "rx_no_registered_mem": 3,
        "rx_no_registered_mem": 0,
@Adirl

This comment has been minimized.

Adirl commented Feb 26, 2017

adding more info:

[root@r-dcs69 ceph]#  ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep rdma
    "xio_transport_type": "rdma",
    "ms_type": "async+rdma",
    "ms_public_type": "async+rdma",
    "ms_cluster_type": "async+rdma",
    "ms_async_rdma_device_name": "mlx5_0",
    "ms_async_rdma_enable_hugepage": "false",
    "ms_async_rdma_buffer_size": "8192",
    "ms_async_rdma_send_buffers": "512",
    "ms_async_rdma_receive_buffers": "512",
    "ms_async_rdma_port_num": "1",
    "ms_async_rdma_polling_us": "1000",
    "ms_async_rdma_local_gid": "",
    "ms_async_rdma_roce_ver": "1",
    "ms_async_rdma_sl": "3",

since we have
"ms_async_rdma_send_buffers": "512",
"ms_async_rdma_receive_buffers": "512",

looks like pool is exhausted on the first object (and he never gets written and never release the buffers)

@Adirl

This comment has been minimized.

Adirl commented Feb 26, 2017

same happens when setting:
"ms_async_rdma_send_buffers": "2048",
"ms_async_rdma_receive_buffers": "2048",

[root@r-dcs69 ceph]# cat /var/log/ceph/ceph-osd.0.log  | grep get_registered_memory
2017-02-26 11:51:39.295010 7f80e1e5f700 20 RDMAStack get_registered_memory need 4194304, got 4194304(512 chunks)
2017-02-26 11:51:39.311608 7f80e1e5f700 20 RDMAStack get_registered_memory need 4194304, got 4194304(512 chunks)
2017-02-26 11:51:39.319628 7f80e1e5f700 20 RDMAStack get_registered_memory need 4194304, got 4194304(512 chunks)
2017-02-26 11:52:04.706475 7f80e2660700 20 RDMAStack get_registered_memory need 4194304, got 4194304(512 chunks)
2017-02-26 11:52:04.718618 7f80e2660700 20 RDMAStack get_registered_memory need 4194304, got 0(0 chunks)
2017-02-26 11:52:19.650520 7f80e2660700 20 RDMAStack get_registered_memory need 4194304, got 0(0 chunks)
2017-02-26 11:52:19.663830 7f80e2660700 20 RDMAStack get_registered_memory need 4194304, got 0(0 chunks)
2017-02-26 11:52:19.673134 7f80e2660700 20 RDMAStack get_registered_memory need 4194304, got 0(0 chunks)
2017-02-26 11:52:19.677549 7f80e2660700 20 RDMAStack get_registered_memory need 4194304, got 0(0 chunks)
2017-02-26 11:52:19.681843 7f80e2660700 20 RDMAStack get_registered_memory need 4194304, got 0(0 chunks)

counters:

r-dcs69 ceph]#  ceph daemon osd.0 perf dump |grep register
        "rx_no_registered_mem": 0,
        "rx_no_registered_mem": 6,
        "rx_no_registered_mem": 0,

allocated memory never gets released....

@Adirl

This comment has been minimized.

Adirl commented Feb 26, 2017

attaching osd log with debug_ms=25

ceph-osd.0.txt

@Adirl

This comment has been minimized.

Adirl commented Feb 26, 2017

As i understand RDMAConnectedSocketImpl::read() needs to post_chunk() back to rdma pool if replication to OSD's is not needed. but i don't see it in the code
so... i think what happens today is that registered buffers are not getting released

makes sense?

@Adirl

This comment has been minimized.

Adirl commented Feb 26, 2017

this is also interesting:

   -3> 2017-02-26 16:50:31.452108 7f31d461e700 20  RDMAConnectedSocketImpl read notify_fd : 0 in 3855 r = -1
    -2> 2017-02-26 16:50:31.452111 7f31d461e700 25  RDMAConnectedSocketImpl read_buffers this iter read: 7865 bytes. offset: 0 ,bound: 0. Chunk:0x7f31e64a0980
    -1> 2017-02-26 16:50:31.452112 7f31d461e700 20 Infiniband post_chunkibv_post_srq_recv failed (12) Cannot allocate memory
     0> 2017-02-26 16:50:31.455354 7f31d461e700 -1 /mnt/dataf/adirl/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: In function 'ssize_t RDMAConnectedSocketImpl::read_buffers(char*, size_t)' thread 7f31d461e700 time 2017-02-26 16:50:31.452123
/mnt/dataf/adirl/src/msg/async/rdma/RDMAConnectedSocketImpl.cc: 323: FAILED assert(infiniband->post_chunk(*c, cct) == 0)

 ceph version 12.0.0-679-ge250ada (e250ada7aaa58bfadbf24e53fc95db1e559dfbf8)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x110) [0x7f31dae9d670]
 2: (RDMAConnectedSocketImpl::read_buffers(char*, unsigned long)+0x6bd) [0x7f31db08a94d]
 3: (RDMAConnectedSocketImpl::read(char*, unsigned long)+0xbe) [0x7f31db08f20e]
 4: (AsyncConnection::read_bulk(char*, unsigned int)+0x4f) [0x7f31db0661ff]
 5: (AsyncConnection::read_until(unsigned int, char*)+0x3f9) [0x7f31db066e69]
 6: (AsyncConnection::process()+0x524) [0x7f31db07b804]
 7: (EventCenter::process_events(int)+0x319) [0x7f31daf297a9]
 8: (()+0xa2d27a) [0x7f31daf2c27a]
 9: (()+0xb5230) [0x7f31d76c9230]
 10: (()+0x7dc5) [0x7f31d7d3edc5]
 11: (clone()+0x6d) [0x7f31d6e30ced]
@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 27, 2017

ok, let's separate this pr firstly.

@yuyuyu101 yuyuyu101 changed the title from msg/async/rdma: make tx zerocopy via aware registered buffer to msg/async/rdma: cleanup Feb 27, 2017

msg/async/rdma: cleanup Infiniband, remove unused functions
Signed-off-by: Haomai Wang <haomai@xsky.com>
@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 27, 2017

I separate zero copy function codes from cleanup codes. now this cleanup code is ok to pass tests

msg/async/rdma: cleanup
Signed-off-by: Haomai Wang <haomai@xsky.com>
@Adirl

This comment has been minimized.

Adirl commented Feb 27, 2017

got it, thanks

msg/async/rdma: accelerate tx/rx buffer ownership lookup
Signed-off-by: Haomai Wang <haomai@xsky.com>
assert(ret == 0);
}
for (unsigned i=0; i < buffers.size(); ++i) {
ret = infiniband->post_chunk(buffers[i]);

This comment has been minimized.

@Adirl

Adirl Feb 27, 2017

recall_chunk() used to check who own the Chunk and then return it back to Tx or Rx pools.
but post_chunk() returns only to Rx pool (srq). is this OK ?

This comment has been minimized.

@yuyuyu101

yuyuyu101 Feb 27, 2017

Member

yes, here we can confirm this is rx buffer.

@Adirl

This comment has been minimized.

Adirl commented Feb 27, 2017

tested: msg/async/rdma: accelerate tx/rx buffer ownership lookup
verified to work on cluster

@yuyuyu101 yuyuyu101 merged commit c0b1e6e into ceph:master Feb 27, 2017

2 of 3 checks passed

default Build started sha1 is merged.
Details
Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details

@yuyuyu101 yuyuyu101 deleted the yuyuyu101:wip-tx-zerocopy branch Feb 27, 2017

@ue90

This comment has been minimized.

ue90 commented Mar 8, 2017

Are there some relations between buffers in pending_bl and buffers reg with ibv_reg_mr ?
I am confused when read "infiniband->is_tx_buffer(it->raw_c_str())" in function submit.
thx!

ssize_t RDMAConnectedSocketImpl::submit(bool more)
{
...
std::vector<Chunk*> tx_buffers;
std::list::const_iterator it = pending_bl.buffers().begin();
std::list::const_iterator copy_it = it;
unsigned total = 0;
unsigned need_reserve_bytes = 0;
while (it != pending_bl.buffers().end()) {
if (infiniband->is_tx_buffer(it->raw_c_str())) {
...
}
...
}

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Mar 8, 2017

eh, it's designed for tx zero copy. if send message's buffer is from registered memory, we can avoid memcpy via using it directly

@ue90

This comment has been minimized.

ue90 commented Mar 8, 2017

Got it. However, on what condition will the buffer of sent message is from registered memory ? I couldn't find it in code.
thx!

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Mar 8, 2017

@ue90 still not. rx zerocopy isn't ready

@frankshangfufei

This comment has been minimized.

frankshangfufei commented Mar 9, 2017

still not, tx zerocopy isn't ready.

@ue90

This comment has been minimized.

ue90 commented Mar 9, 2017

@yuyuyu101
thanks .
I have learnt that tx_cq was added and a fd for it was registered to poll. But I am not figure out why do it like this. Does it can improve performance ?

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Mar 9, 2017

@ue90 because I separated tx, rx event into two cq. So we need to poll two cq if available

@ue90

This comment has been minimized.

ue90 commented Mar 9, 2017

@yuyuyu101 Got it. What are the advantages of separating tx, rx event into two cq?

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Mar 9, 2017

@ue90 just more clear. not performance purpose

@ue90

This comment has been minimized.

ue90 commented Mar 9, 2017

@yuyuyu101 thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment