New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg/async/rdma: refactor tx handle flow to get rid of locks #13680

Merged
merged 5 commits into from Mar 2, 2017

Conversation

Projects
None yet
2 participants
@yuyuyu101
Member

yuyuyu101 commented Feb 28, 2017

previously Dispatcher thread will poll both rx and tx events, then dispatch these events to RDMAWorker and RDMAConnectedSocketImpl.

Actually tx event handling is a lightweight task and we make these handling inline now. rx event dispatching is still working.

Another change is adding tx cq to make event polling separated.

removing lots of codes yet.

yuyuyu101 added some commits Feb 27, 2017

msg/async/rdma: change notify to notify_worker
Signed-off-by: Haomai Wang <haomai@xsky.com>
msg/async/rdma: remove unused variable
Signed-off-by: Haomai Wang <haomai@xsky.com>
msg/async/rdma: remove lock protection for pending_sent_conns
Signed-off-by: Haomai Wang <haomai@xsky.com>
@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 28, 2017

@Adirl @orendu now the whole io flow will make much clear... dispatcher thread handle tx cq directly, dispatch rx cq to workers. less locks, less hand on

@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Feb 28, 2017

still need to solve failed test case

while (!pending_sent_conns.empty()) {
RDMAConnectedSocketImpl *o = pending_sent_conns.front();
pending_sent_conns.pop_front();
if (!done.count(o)) {
lock.Unlock();
done.insert(o);
ssize_t r = o->submit(false);
ldout(cct, 20) << __func__ << " sent pending bl socket=" << o << " r=" << r << dendl;

This comment has been minimized.

@Adirl

Adirl Feb 28, 2017

looks like false param to submit has no effect
?!

This comment has been minimized.

@yuyuyu101
@Adirl

Adirl approved these changes Feb 28, 2017

int tx_ret = tx_cq->poll_cq(MAX_COMPLETIONS, wc);
if (tx_ret > 0) {
ldout(cct, 20) << __func__ << " pool tx completion queue got " << tx_ret
<< " responses."<< dendl;

This comment has been minimized.

@Adirl

Adirl Feb 28, 2017

i guess you mean "poll "

int rx_ret = rx_cq->poll_cq(MAX_COMPLETIONS, wc);
if (rx_ret > 0) {
ldout(cct, 20) << __func__ << " pool rt completion queue got " << rx_ret
<< " responses."<< dendl;

This comment has been minimized.

@Adirl

Adirl Feb 28, 2017

i guess you mean "poll rx"

if (tx_ret > 0) {
ldout(cct, 20) << __func__ << " pool tx completion queue got " << tx_ret
<< " responses."<< dendl;
handle_tx_event(wc, tx_ret);

This comment has been minimized.

@Adirl

Adirl Feb 28, 2017

handle here? on same thread?

This comment has been minimized.

@Adirl

Adirl Feb 28, 2017

you are correct, it's ok to handle Tx here,
i see you removed some code from post_tx_buffer() that i was worried about

}
// FIXME: why not tx?
if (ib->get_memory_manager()->is_tx_buffer(chunk->buffer))

This comment has been minimized.

@Adirl

Adirl Feb 28, 2017

maybe assert it's Tx cqe like you did for Rx
assert(wc[i].opcode == IBV_WC_RECV);

ib->get_memory_manager()->return_tx(chunks);
ldout(cct, 30) << __func__ << " release " << chunks.size()
<< " chunks, inflight " << inflight << dendl;
notify_pending_workers();

This comment has been minimized.

@Adirl

Adirl Feb 28, 2017

why is that needed?

This comment has been minimized.

@yuyuyu101

yuyuyu101 Mar 1, 2017

Member

wake up worker to handle pending message which is eager for tx buffer

ldout(cct, 30) << __func__ << " reserve " << r << " chunks, inflight " << dispatcher->inflight << dendl;
assert(r >= 0);
size_t got = infiniband->get_memory_manager()->get_tx_chunk_size() * r;
ldout(cct, 30) << __func__ << " need " << bytes << " bytes, reserve " << got << " bytes, inflight " << dispatcher->inflight << dendl;

This comment has been minimized.

@Adirl

Adirl Feb 28, 2017

maybe
ldout(cct, 30) << func << " need " << bytes << " bytes, got " << got << "registered bytes, inflight " << dispatcher->inflight << dendl;

@Adirl

This comment has been minimized.

Adirl commented Feb 28, 2017

attaching osd log of this code running on real cluster
FYI

ceph-osd.0 (2).txt

@Adirl

This comment has been minimized.

Adirl commented Feb 28, 2017

@yuyuyu101
found few of these in osd log:
2017-02-28 12:57:50.671156 7f2b1c928700 1 RDMAStack handle_tx_event not tx buffer, chunk 0x7f2b2fb74000

shouldn't happen, right ?

@@ -434,114 +452,49 @@ int RDMAWorker::connect(const entity_addr_t &addr, const SocketOptions &opts, Co
int RDMAWorker::reserve_message_buffer(RDMAConnectedSocketImpl *o, std::vector<Chunk*> &c, size_t bytes)

This comment has been minimized.

@Adirl

Adirl Feb 28, 2017

can we call it: "get_reged_mem()"

@Adirl

This comment has been minimized.

Adirl commented Mar 1, 2017

@yuyuyu101
is this last commit : msg/async: must destruct ConnectedSocket in the his own thread …
solving the issue i saw ? (RDMAStack handle_tx_event not tx buffer)

yuyuyu101 added some commits Feb 27, 2017

msg/async/rdma: refactor RDMAStack to accelerate tx handle
previously Dispatcher thread will poll both rx and tx events, then dispatch
these events to RDMAWorker and RDMAConnectedSocketImpl.

Actually tx event handling is a lightweight task and we make these handling
inline now. rx event dispatching is still working.

Another change is adding tx cq to make event polling separated.

removing lots of codes yet.

Signed-off-by: Haomai Wang <haomai@xsky.com>
msg/async: must destruct ConnectedSocket in the his own thread
Signed-off-by: Haomai Wang <haomai@xsky.com>
@yuyuyu101

This comment has been minimized.

Member

yuyuyu101 commented Mar 1, 2017

@Adirl fixed now. plz check again.

@Adirl

This comment has been minimized.

Adirl commented Mar 1, 2017

@yuyuyu101
looks ok so far

[root@r-dcs69 ~]# rados bench -p rbd 10 write --no-cleanup                                                                                         
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)                                 
    0       0         0         0         0         0           -           0                                 
    1      16        32        16   63.9953        64     0.88772    0.500009    
[root@r-dcs69 ~]#  rados bench -p rbd 10 rand
  sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
    0       0         0         0         0         0           -           0
    1      16       773       757   3027.35      3028  0.00602852   0.0198304
@Adirl

This comment has been minimized.

Adirl commented Mar 1, 2017

i see some memory missing at worker:
"AsyncMessenger::RDMAWorker-1": {
"tx_no_mem": 8,
"tx_parital_mem": 7,
"tx_failed_post": 0,

@Adirl

This comment has been minimized.

Adirl commented Mar 1, 2017

./fio --ioengine=rbd --rw=rw --bs=1M --numjobs=1 --clientname=admin --pool=rbd --iodepth=1 --rbdname=fio_test --name=1 --time_based --runtime=20 --verify=crc32c --do_verify=1

  READ: bw=15.8MiB/s (16.5MB/s), 15.8MiB/s-15.8MiB/s (16.5MB/s-16.5MB/s), io=314MiB (329MB), run=20004-20004msec
  WRITE: bw=16.6MiB/s (17.3MB/s), 16.6MiB/s-16.6MiB/s (17.3MB/s-17.3MB/s), io=330MiB (346MB), run=20004-20004msec

also counters are clean
hooray!

@yuyuyu101 yuyuyu101 merged commit 1a457be into ceph:master Mar 2, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@yuyuyu101 yuyuyu101 deleted the yuyuyu101:wip-tx-zerocopy branch Mar 2, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment