send-zerocopy and linked timeout: how to deal buffer when timeout? #643

beldzhang · 2022-08-25T04:00:15Z

prepare to upgrade our diskless server to use send-zerocopy

send sqe linked with a timeout sqe to detect network problem, linked-timeout will return ETIME when something happen,
at this point, how to deal the buffer passed in by ...sendzc() ?
can it just be released as regular socket program? or is it still locked(maybe?) by somewhere in kernel stack ?

The text was updated successfully, but these errors were encountered:

isilence · 2022-08-29T14:40:49Z

prepare to upgrade our diskless server to use send-zerocopy

That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)?

heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).

send sqe linked with a timeout sqe to detect network problem, linked-timeout will return ETIME when something happen, at this point, how to deal the buffer passed in by ...sendzc() ? can it just be released as regular socket program? or is it still locked(maybe?) by somewhere in kernel stack ?

Close() is often is not of much help, but let's say you call shutdown(SHUT_RDWR), then it's reasonable to assume that the socket will release buffers and post a notification in a bound amount of time (that's considering that you flushed notifications and so). As for whether you can reuse buffers immediately after shutdown() without waiting for notification CQEs, it's not currently supported, i.e. you'd need to wait for CQEs, but that's a good question for me to think about later.

beldzhang · 2022-08-31T01:57:25Z

That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)?
heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).

It is over TCP, payload size was requested by client side, from 512B to 256KB
Already start upgrading codes, sendzc has a noticeable reduce of CPU usage on my server.

Close() is often is not of much help, but let's say you call shutdown(SHUT_RDWR) ...

Will try to simulate this after code upgraded.

another issue found:
because splice consume higher CPU, and sendzc looks to resolve this, I do a re-design to use fixed-buffer,
the linked sqes is:

1) read_fixed(data)
2) sendzc(head + data)
3) link_timeout

when testing with a solid image file and a raw hard drive, sqe 2) returned ECANCELED after processed about 1k~2k requests
already checked 1) has right returned length, and happend with or without 3)
it happens even there is only one client connected
found in kernel: 5.15.63, 5.19.5, 6.0-rc3
EDIT: sqe 2) has problem when using send() instead of sendzc(), and occured with non fixed-buffer also.

BTW, I know sendfile() was not welcomed, after compared with a thread-mode server, sendfile() has the lowest cpu usage, in user-mode it is no need to care about a pipe, and it is useful to many kinds of use case(ex: http server).
so is there any possibility to add this or something like?

beldzhang · 2022-09-02T18:31:08Z

kernel....: 5.19.5
liburing..: 2.3-git
tested with READ and READ_FIXED, it's the same.
after change IOSQE_IO_LINK to IOSQE_IO_HARDLINK, this will NOT happen.

each ECANCELED will cause following 2 SQEs failed, SEND and TIMEOUT

[DEVEL-10:~/.../kernel/dell]# perf trace -e io_uring:io_uring_fail_link --call-graph fp
     0.000 iou-0/24077 io_uring:io_uring_fail_link(ctx: 0xffff8883a103c800, req: 0xffff88810ccef600, user_data: 140370983686400, opcode: 22, link: 0xffff88810cceff00, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       DEBUGEND_$NZ_IORING (/root/work/devel/2/common/tools/benckmark/idrive6)
                                       [0] ([unknown])
                                       [0] ([unknown])
     0.008 iou-0/24077 io_uring:io_uring_fail_link(ctx: 0xffff8883a103c800, req: 0xffff88810ccef600, user_data: 140370983686400, opcode: 22, link: 0xffff88810ccee400, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       DEBUGEND_$NZ_IORING (/root/work/devel/2/common/tools/benckmark/idrive6)
                                       [0] ([unknown])
                                       [0] ([unknown])

beldzhang · 2022-09-03T18:30:07Z

found notification slot was removed from both kernel and liburing
after some checking until 42ee2bd, there is no any flag to check send_zc was support by current running kernel.
is there any planning to add some flag or just check the kernel version >= 6.0 ?

axboe · 2022-09-03T18:32:16Z

Never rely on a kernel version check. Either just try and the command and see if it fails, or use io_uring_get_probe() and check if that opcode is marked as supported.

isilence · 2022-09-05T13:18:16Z

found notification slot was removed from both kernel and liburing after some checking until 42ee2bd, there is no any flag to check send_zc was support by current running kernel. is there any planning to add some flag or just check the kernel version >= 6.0 ?

Agree with Jens on the testing side.

The API should be easier now. If it sends anything, io_uring returns 2 CQEs. The first one is the usual request completion with IORING_CQE_F_MORE set, and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF. If fails (i.e. cqe->res < 0), there will be only one CQE with no IORING_CQE_F_MORE. So, you can look for IORING_CQE_F_MORE to figure out whether there will be one or two CQEs. I'll write a man I think this week.

isilence · 2022-09-05T13:22:32Z

That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)?
heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).

It is over TCP, payload size was requested by client side, from 512B to 256KB Already start upgrading codes, sendzc has a noticeable reduce of CPU usage on my server.

Nice!

BTW, I know sendfile() was not welcomed, after compared with a thread-mode server, sendfile() has the lowest cpu usage, in user-mode it is no need to care about a pipe, and it is useful to many kinds of use case(ex: http server). so is there any possibility to add this or something like?

I don't have a strong opinion on that one, it could be implemented but would definitely require more changes than just adding a sendfile io_uring opcode.

isilence · 2022-09-05T13:36:58Z

another issue found: because splice consume higher CPU, and sendzc looks to resolve this, I do a re-design to use fixed-buffer, the linked sqes is:

Fixed buffers are faster, but fwiw you can also use normal buffers with sendzc.

1) read_fixed(data)
2) sendzc(head + data)
3) link_timeout

Just a general thought on performance which might not matter for your case but still: linked timeouts are not the fastest thing, so I'd experiment and see if there is a noticeable speedup from not having them. One idea how to optimise it is to have a per-ring timer / timeout request (or probably a couple of them), and if/when they fire you issue a bunch of IORING_OP_ASYNC_CANCEL.

when testing with a solid image file and a raw hard drive, sqe 2) returned ECANCELED after processed about 1k~2k requests already checked 1) has right returned length, and happend with or without 3) it happens even there is only one client connected found in kernel: 5.15.63, 5.19.5, 6.0-rc3 EDIT: sqe 2) has problem when using send() instead of sendzc(), and occured with non fixed-buffer also.

Not entirely sure I get it, what do you mean by "happend with or without 3)"? Or maybe you have a simple reproducer?
It should always post 3 CQEs per such link unless you use IORING_FEAT_CQE_SKIP.

beldzhang · 2022-09-05T17:30:02Z

changed to io_uring_get_probe() to check sendzc support, thanks to Jens
will testing new api with 6.0-rc4 this week, and will check is there any performance lost with linked-timeout.

--

for the abnormal ECANCELED, uploaded the testing program.

Not entirely sure I get it, what do you mean by "happend with or without 3)"? Or maybe you have a simple reproducer?

this happened when submit 2 SQEs (read/send) or 3 SQEs(read/send/timeout).
output of testing server:

[DEVEL-10:~/.../uring/tests]# make s && ./server /dev/sda
gcc -g server.c -o server -l uring
 ++ image size = 250059350016
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 / 12413 !!
        resp-data: err=-125

in server.c, sending_timeout set to 1 to include timeout, 0 excluded. tested on 5.19.5.

issue-643-v1.zip

beldzhang · 2022-09-06T00:52:45Z

The first one is the usual request completion with IORING_CQE_F_MORE set

in this one cqe.res carry out the right number of sent

and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF

in this one cqe.res is 0, so the use space must kept the sent number from first one to handle protocol data.

is this the designed behavior?

isilence · 2022-09-06T09:24:46Z

The first one is the usual request completion with IORING_CQE_F_MORE set

in this one cqe.res carry out the right number of sent

and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF

in this one cqe.res is 0,

is this the designed behavior?

Yes, that's by design, can be changed if there is a good reason for that.

so the use space must kept the sent number from first one to handle protocol data.

Not sure why you'd need the result at that point. The user should not modify data in the buffer until it gets the notification, but it's still allowed to read it, including with syscalls like write / send. The usual scenario on getting a notification should be the userspace freeing / reusing the buffer, and I expect the user knows everything it needs about the buffer like it's length.

beldzhang · 2022-09-06T13:42:43Z

oh... sorry for not describe clearly
what want to say is the "protocol handling",
in one request-response

recv-protocol-head
if (cqe.res == head_length)
    send-head-and-data
    if (cqe.res == required_length)
        next_loop

after change send-head-and-data to zerocopy, the cqe.res from F_NOTIF one is zero, can not be used to check the proper amount of sent, so save the cqe.res from F_MORE is required.

after more consideration, this is the best way to handle sendzc until now. +1.

isilence · 2022-09-08T13:18:03Z

for the abnormal ECANCELED, uploaded the testing program.

this happened when submit 2 SQEs (read/send) or 3 SQEs(read/send/timeout). output of testing server:
[DEVEL-10:~/.../uring/tests]# make s && ./server /dev/sda
gcc -g server.c -o server -l uring
 ++ image size = 250059350016
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 / 12413 !!
        resp-data: err=-125
in server.c, sending_timeout set to 1 to include timeout, 0 excluded. tested on 5.19.5.

issue-643-v1.zip

Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.

isilence · 2022-09-08T13:21:55Z

BTW, did you use O_DIRECT on the block device file when you was benchmarking zc send vs sendfile?

beldzhang · 2022-09-08T14:14:49Z

Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.

my sda is a USB-SSD disk, and re-tested with a nvme disk on 6.0-rc4, it's the same.
oh, all this test is on a raw-disk WITHOUT any mount point on it.

my system is Slackware 15.0 with customized kernel, might it related?

BTW, did you use O_DIRECT on the block device file when you was benchmarking zc send vs sendfile?

no, never open any image or disk with O_DIRECT

isilence · 2022-09-09T09:41:51Z

Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.

my sda is a USB-SSD disk, and re-tested with a nvme disk on 6.0-rc4, it's the same. oh, all this test is on a raw-disk WITHOUT any mount point on it.

my system is Slackware 15.0 with customized kernel, might it related?

Finally caught the bug and see where it came from. I'd like to add your reported-by tag in the fix-commit so it's in the history that you found the bug and helped the community, but I'd need your full name and email. Looks like:

Reported-by: Pavel Begunkov <my_email@example.com>

beldzhang · 2022-09-09T14:56:21Z

very happy to help the community,
My Name: Beld Zhang
E-Mail: beldzhang@gmail.com

thanks to all for the hard works !

axboe · 2022-09-09T14:58:56Z

Thanks, I've updated the commit and also included a link to this issue.

We have a couple of problems, first reports of unexpected link breakage for reads when cqe->res indicates that the IO was done in full. The reason here is partial IO with retries. TL;DR; we compare the result in __io_complete_rw_common() against req->cqe.res, but req->cqe.res doesn't store the full length but rather the length left to be done. So, when we pass the full corrected result via kiocb_done() -> __io_complete_rw_common(), it fails. The second problem is that we don't try to correct res in io_complete_rw(), which, for instance, might be a problem for O_DIRECT but when a prefix of data was cached in the page cache. We also definitely don't want to pass a corrected result into io_rw_done(). The fix here is to leave __io_complete_rw_common() alone, always pass not corrected result into it and fix it up as the last step just before actually finishing the I/O. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: axboe/liburing#643 Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>

beldzhang · 2022-09-09T17:57:38Z

Thanks, I've updated the commit and also included a link to this issue.

patched 4d9cb92 to 6.0-rc4, my testing are ok now.

next will check/test error handling of this issue begin with.

isilence · 2022-09-11T11:20:20Z

@beldzhang, does it go right? Looks you've deleted a message. Anyway, let me drop a couple of notes here:

It'll be slower for small per-request payloads. So, for TCP, if you're sending 1MB per request, you have nothing to worry about, but 4KB it might be slower. If the payload size is not fixed, I'd recommend to mix normal and zc sends depending on payload sizes. I'm not sure what's the tipping point, but you can try to find it experimentally.
I have a branch with zc sendmsg and so iov, will be sending for 6.1, tell me if it'd help you to increase the payload size.
If you have problems, send send me perf profiles, one for setup with normal sends and one for zc sends. I'll take a look.
Why I asked about O_DIRECT for reads is because w/o it it does an extra copy into the page cache, which is not great. Just fwiw
Do you utilise 100% of CPU / core? You may want to pin the process to one core

isilence · 2022-09-11T11:27:17Z

It might also be a latency problem. Don't forget that the second CQE will likely come with a good delay from the first one, and you absolutely don't want to wait for it before for sending more I/O. In other words, don't strictly keep:

submit send
wait send complete
wait for buffer complete
goto 1)

but rather more like

submit send
wait send complete
goto 1
(step 4.5, asynchronous) wait for buffer complete

beldzhang · 2022-09-11T12:57:00Z

It might also be a latency problem...

yes, it is the latency... post that message too rush before found out in my hardware environment. using a usb-ethernet that has large tx-timer, previous testing not affectted but sendzc does.

In other words, don't strictly keep: 1. submit send, 2. wait send complete, 3. wait for buffer complete, 4. goto 1)

sadly current is like this, 3) spent long time to wait and cause the speed dropdown. will try to use second model in next devel cycle.

It'll be slower for small per-request payloads...

Why I asked about O_DIRECT for reads...

will testing next week.

already tested error handling and they are all expected. look like we could close this issue successfully soon.

thanks a lot for the notes and advices.

beldzhang · 2022-09-12T18:14:51Z

error handling pass
sendzc have lower cpu usage even in sync waiting for buffer complete
async waiting of buffer complete is postponed because buffer management redesign required
another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc
in ...prep_recv() set flag MSG_WAITALL, but still got partial data recv, then following write sqe be canceled
is it known? any plan to fix or not?
in liburing.h ...prep_sendzc_fixed was removed, is possible add back?

beldzhang · 2022-09-15T03:04:37Z

Finally caught the bug and see where it came from

the unexpected ECANCELED happened on the testing server with 2 CPU, 256KB random read and linked with send.
nvme0n1 is no mountpoint.

[nzel-server:~/w/devel/2/test/uring-inactive/tests]# uname -r
6.0.0-rc5-64-6
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 /    51 !!
        resp-data: err=-125

testing program attached: issue-643-v2.zip
also include cli_write.c, for

another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc

[nzel-server:~/w/devel/2/test/uring-inactive/tests]# uname -r
5.15.64-64-6
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
 !! partial recv data:  65450, require: 262144
 !! partial recv data: 163705, require: 196694
 !! partial recv data:  32741, require:  32989
 !! partial recv data: 229344, require: 262144
...

update:
patched and tested: 62bb0647
hope is useful

[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 / 16372 !!
        resp-data: err=-125

[nzel-server:~]# perf trace -e io_uring:io_uring_fail_link --call-graph fp
     0.000 server/1620 io_uring:io_uring_fail_link(ctx: 0xffff888111027800, req: 0xffff888104098e00, user_data: 3, opcode: 22, link: 0xffff888104072400, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       ctx_flush_and_put ([kernel.kallsyms])              <-- this is different with previous
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       io_run_task_work_sig ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       syscall (/lib64/libc-2.33.so)
                                       [0] ([unknown])
                                       [0] ([unknown])
     0.018 server/1620 io_uring:io_uring_fail_link(ctx: 0xffff888111027800, req: 0xffff888104098e00, user_data: 3, opcode: 22, link: 0xffff888104098a00, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       ctx_flush_and_put ([kernel.kallsyms])
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       io_run_task_work_sig ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       syscall (/lib64/libc-2.33.so)
                                       [0] ([unknown])
                                       [0] ([unknown])

beldzhang · 2022-09-18T00:39:16Z

finally found it
in io_read(), if there is no retry, currently only passed the last ret of io_iter_do_read to kiocb_done then failed in __io_complete_rw_common() because ret != cqe.res,
it should be the total amount of all io_iter_do_read got. and only increase io->bytes_done when there is retry.

try to fix it but can not get it done(don't know how to deal with retry).

add some log, found following info
a. not understanding why cqe.res less
b. when res in __io_complete_rw_common shorten, this req was marked fail, and final_ret is not the correct value also.

[nzel-server:~/uring-test]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 /    16 !!
        resp-data: err=-125

[  239.661253] ++io_read:  cqe.res(262144)
[  239.700637] ++io_read:  cqe.res(262144)
[  239.700806] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.701070] ++io_read:  cqe.res(262144)
[  239.701320] ++io_read:  cqe.res(262144)
[  239.701431] ++io_read:  cqe.res(  3072)
[  239.701433] ++__io_complete_rw_common: res=  3072, cqe.res=  3072
[  239.701586] ++io_read:  cqe.res(262144)
[  239.701848] ++io_read:  cqe.res(262144)
[  239.701952] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.702135] ++io_read:  cqe.res(262144)
[  239.702323] ++io_read:  cqe.res(262144)
[  239.702436] ++io_read:  cqe.res(  1536)
[  239.702439] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.702565] ++io_read:  cqe.res(262144)
[  239.702788] ++io_read:  cqe.res(262144)
[  239.702897] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.703037] ++io_read:  cqe.res(262144)
[  239.703297] ++io_read:  cqe.res(262144)
[  239.703395] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.703551] ++io_read:  cqe.res(262144)
[  239.703768] ++io_read:  cqe.res(262144)
[  239.703878] ++io_read:  cqe.res(  1024)
[  239.703880] ++__io_complete_rw_common: res=  1024, cqe.res=  1024
[  239.703993] ++io_read:  cqe.res(262144)
[  239.704192] ++io_read:  cqe.res(262144)
[  239.704312] ++io_read:  cqe.res(  2048)
[  239.704315] ++__io_complete_rw_common: res=  2048, cqe.res=  2048
[  239.704426] ++io_read:  cqe.res(262144)
[  239.704623] ++io_read:  cqe.res(262144)
[  239.704752] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.704867] ++io_read:  cqe.res(262144)
[  239.705065] ++io_read:  cqe.res(262144)
[  239.705181] ++io_read:  cqe.res(  2560)
[  239.705183] ++__io_complete_rw_common: res=  2560, cqe.res=  2560
[  239.705291] ++io_read:  cqe.res(262144)
[  239.705476] ++io_read:  cqe.res(262144)
[  239.705591] ++io_read:  cqe.res(  1024)
[  239.705593] ++__io_complete_rw_common: res=  1024, cqe.res=  1024
[  239.705695] ++io_read:  cqe.res(262144)
[  239.705905] ++io_read:  cqe.res(262144)
[  239.706030] ++io_read:  cqe.res(  1536)
[  239.706033] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.706109] ++io_read:  cqe.res(262144)
[  239.706290] ++io_read:  cqe.res(262144)
[  239.706391] ++io_read:  cqe.res(  1024)
[  239.706393] ++__io_complete_rw_common: res=  1024, cqe.res=  1024
[  239.706472] ++io_read:  cqe.res(262144)
[  239.706648] ++io_read:  cqe.res(262144)
[  239.706747] ++io_read:  cqe.res(  1536)
[  239.706749] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.706829] ++io_read:  cqe.res(262144)
[  239.707013] ++io_read:  cqe.res(262144)
[  239.707111] ++io_read:  cqe.res(  1536)
[  239.707113] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.707194] ++io_read:  cqe.res(262144)
[  239.707386] ++io_read:  cqe.res(262144)
[  239.707504] ++__io_complete_rw_common: res=  3072, cqe.res=262144
[  239.707505]  !! __io_complete_rw_common(): res(  3072) != cqe.res(262144)
[  239.707507]    ++ req_set_fail: op=22, flags=40200004
[  239.707508] CPU: 31 PID: 1457 Comm: server Not tainted 6.0.0-rc5-64-6b #21
[  239.707510] Hardware name: Dell Inc. Precision T5600/000000, BIOS A19 06/30/2019
[  239.707512] Call Trace:
[  239.707513]  <TASK>
[  239.707515]  dump_stack_lvl+0x34/0x44
[  239.707518]  __io_complete_rw_common+0xac/0xd0
[  239.707521]  kiocb_done.cold+0xc/0x3c
[  239.707524]  io_read.cold+0x10f/0x2cf
[  239.707526]  ? __kmalloc+0x64/0x4b0
[  239.707529]  ? update_load_avg+0x7e/0x740
[  239.707532]  ? newidle_balance+0x2ba/0x3f0
[  239.707534]  ? io_writev_prep_async+0x80/0x80
[  239.707537]  io_issue_sqe+0x60/0x410
[  239.707540]  ? _raw_spin_unlock+0x12/0x30
[  239.707542]  ? finish_task_switch.isra.0+0xa3/0x2b0
[  239.707545]  io_req_task_submit+0x2a/0x70
[  239.707548]  handle_tw_list+0x9a/0xd0
[  239.707551]  tctx_task_work+0x4e/0x120
[  239.707554]  task_work_run+0x59/0x90
[  239.707557]  io_run_task_work_sig+0x43/0x60
[  239.707559]  __do_sys_io_uring_enter+0x2a2/0x8a0
[  239.707561]  ? percpu_ref_get_many+0x30/0x30
[  239.707564]  do_syscall_64+0x3d/0x90
[  239.707566]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  239.707568] RIP: 0033:0x7f3662bb1c09
[  239.707571] Code: 48 8d 3d 6a 1f 0d 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2f e2 0c 00 f7 d8 64 89 01 48
[  239.707573] RSP: 002b:00007fff43f32918 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
[  239.707575] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3662bb1c09
[  239.707577] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000005
[  239.707578] RBP: 0000000000404100 R08: 0000000000000000 R09: 0000000000000008
[  239.707579] R10: 0000000000000001 R11: 0000000000000216 R12: 00007fff43f32970
[  239.707580] R13: 00007fff43f329b8 R14: 0000000000000000 R15: 0000000000000000
[  239.707582]  </TASK>

beldzhang · 2022-09-19T00:02:29Z

ok, a wild wild west concept patch that fixed this abnormal ECANCELED, based on 6.0-rc6

--- rw-orig.c	2022-09-18 16:44:14.000000000 -0400
+++ rw.c	2022-09-18 19:53:58.725235656 -0400
@@ -706,7 +706,7 @@
 	struct kiocb *kiocb = &rw->kiocb;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	struct io_async_rw *io;
-	ssize_t ret, ret2;
+	ssize_t ret, ret2, ret_all = 0;
 	loff_t *ppos;
 
 	if (!req_has_async_data(req)) {
@@ -763,6 +763,7 @@
 	}
 
 	ret = io_iter_do_read(rw, &s->iter);
+	if (ret > 0) ret_all += ret;
 
 	if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
 		req->flags &= ~REQ_F_REISSUE;
@@ -814,7 +815,7 @@
 		iov_iter_advance(&s->iter, ret);
 		if (!iov_iter_count(&s->iter))
 			break;
-		io->bytes_done += ret;
+//		io->bytes_done += ret;
 		iov_iter_save_state(&s->iter, &s->iter_state);
 
 		/* if we can retry, do so with the callbacks armed */
@@ -830,8 +831,11 @@
 		 * here, and if we do, then just retry at the new offset.
 		 */
 		ret = io_iter_do_read(rw, &s->iter);
-		if (ret == -EIOCBQUEUED)
+		if (ret > 0) ret_all += ret;
+		if (ret == -EIOCBQUEUED) {
+			io->bytes_done += ret_all;
 			return IOU_ISSUE_SKIP_COMPLETE;
+		}
 		/* we got some bytes, but not all. retry. */
 		kiocb->ki_flags &= ~IOCB_WAITQ;
 		iov_iter_restore(&s->iter, &s->iter_state);
@@ -840,7 +844,7 @@
 	/* it's faster to check here then delegate to kfree */
 	if (iovec)
 		kfree(iovec);
-	return kiocb_done(req, ret, issue_flags);
+	return kiocb_done(req, ret_all, issue_flags);
 }
 
 int io_write(struct io_kiocb *req, unsigned int issue_flags)

isilence · 2022-09-26T15:13:18Z

2. sendzc have lower cpu usage even in sync waiting for buffer complete

Great

3. async waiting of buffer complete is postponed because buffer management redesign required

4. another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc
   in `...prep_recv()` set flag `MSG_WAITALL`, but still got partial data recv, then following write sqe be canceled
   is it known? any plan to fix or not?

The short read fix hasn't been ported to older kernels. Are you using a some stable kernel?

5. in liburing.h `...prep_sendzc_fixed` was removed, is possible add back?

Yeah, can add it back

beldzhang · 2022-09-26T17:27:56Z

oh, there are two different issues:

another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc ...

this is found recently, and after comparing source between this versions, looks like it is difficult to backport due to huge changes, so please ignore it.

The short read fix hasn't been ported to older kernels.

the unexpected ECANCELED happened on the testing server with 2 CPU, 256KB random read and linked with send.

the short read is not completely fixed yet, post some messages start from here: #643 (comment)
testing codes already included. issue-643-v2.zip

in my debugging, found the bytes_done and req->cqe.res are lost syncing when there retries ...

Are you using a some stable kernel?

currently we are running 5.15.x and preparing upgrade to 6.0 or 6.1.

isilence · 2022-09-26T18:16:51Z

Ah, I see. For backports, I do have them, just didn't send out as was waiting for another patch that it needs to land upstream

isilence · 2022-09-26T22:15:48Z

finally found it in io_read(), if there is no retry, currently only passed the last ret of io_iter_do_read to kiocb_done then failed in __io_complete_rw_common() because ret != cqe.res, it should be the total amount of all io_iter_do_read got. and only increase io->bytes_done when there is retry.

cqe.res stores the numbers of bytes we're currently trying to do, so how about this?

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 76ebcfebc9a6..c562203d7a67 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -823,6 +823,7 @@ int io_read(struct io_kiocb *req, unsigned int issue_flags)
 			return -EAGAIN;
 		}
 
+		req->cqe.res = iov_iter_count(&s->iter);
 		/*
 		 * Now retry read with the IOCB_WAITQ parts set in the iocb. If
 		 * we get -EIOCBQUEUED, then we'll get a notification when the

I also pushed to a branch with an explanation for convenience.

isilence/linux@9a70c23

edit: branch
https://github.com/isilence/linux.git rw-link-break

beldzhang · 2022-09-26T22:38:49Z

cqe.res stores the numbers of bytes we're currently trying to do, so how about this?

testing pass, ~60GB reading at 256KB block for 3 times, patched on 6.0-rc7.

btw, since never met issues on io_write so I think that part are all good.

isilence · 2022-09-26T22:42:35Z

Perfect, can I add your tested-by tag?

beldzhang · 2022-09-26T23:25:13Z

Perfect, can I add your tested-by tag?

yes, very glad of this.

isilence · 2022-09-26T23:53:03Z

Let's close this issue, it long overstepped the initial purpose, you can open a new open if you get some problems. And I'll be quite curious to hear about results you may get out of it in the future.

We have a couple of problems, first reports of unexpected link breakage for reads when cqe->res indicates that the IO was done in full. The reason here is partial IO with retries. TL;DR; we compare the result in __io_complete_rw_common() against req->cqe.res, but req->cqe.res doesn't store the full length but rather the length left to be done. So, when we pass the full corrected result via kiocb_done() -> __io_complete_rw_common(), it fails. The second problem is that we don't try to correct res in io_complete_rw(), which, for instance, might be a problem for O_DIRECT but when a prefix of data was cached in the page cache. We also definitely don't want to pass a corrected result into io_rw_done(). The fix here is to leave __io_complete_rw_common() alone, always pass not corrected result into it and fix it up as the last step just before actually finishing the I/O. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: axboe/liburing#643 Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>

beldzhang closed this as completed Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

send-zerocopy and linked timeout: how to deal buffer when timeout? #643

send-zerocopy and linked timeout: how to deal buffer when timeout? #643

beldzhang commented Aug 25, 2022

isilence commented Aug 29, 2022

beldzhang commented Aug 31, 2022 •

edited

Loading

beldzhang commented Sep 2, 2022

beldzhang commented Sep 3, 2022

axboe commented Sep 3, 2022

isilence commented Sep 5, 2022

isilence commented Sep 5, 2022

isilence commented Sep 5, 2022

beldzhang commented Sep 5, 2022

beldzhang commented Sep 6, 2022

isilence commented Sep 6, 2022

beldzhang commented Sep 6, 2022 •

edited

Loading

isilence commented Sep 8, 2022

isilence commented Sep 8, 2022

beldzhang commented Sep 8, 2022

isilence commented Sep 9, 2022

beldzhang commented Sep 9, 2022

axboe commented Sep 9, 2022

beldzhang commented Sep 9, 2022

isilence commented Sep 11, 2022

isilence commented Sep 11, 2022

beldzhang commented Sep 11, 2022

beldzhang commented Sep 12, 2022 •

edited

Loading

beldzhang commented Sep 15, 2022 •

edited

Loading

beldzhang commented Sep 18, 2022 •

edited

Loading

beldzhang commented Sep 19, 2022

isilence commented Sep 26, 2022

beldzhang commented Sep 26, 2022

isilence commented Sep 26, 2022

isilence commented Sep 26, 2022 •

edited

Loading

beldzhang commented Sep 26, 2022

isilence commented Sep 26, 2022

beldzhang commented Sep 26, 2022

isilence commented Sep 26, 2022

send-zerocopy and linked timeout: how to deal buffer when timeout? #643

send-zerocopy and linked timeout: how to deal buffer when timeout? #643

Comments

beldzhang commented Aug 25, 2022

isilence commented Aug 29, 2022

beldzhang commented Aug 31, 2022 • edited Loading

beldzhang commented Sep 2, 2022

beldzhang commented Sep 3, 2022

axboe commented Sep 3, 2022

isilence commented Sep 5, 2022

isilence commented Sep 5, 2022

isilence commented Sep 5, 2022

beldzhang commented Sep 5, 2022

beldzhang commented Sep 6, 2022

isilence commented Sep 6, 2022

beldzhang commented Sep 6, 2022 • edited Loading

isilence commented Sep 8, 2022

isilence commented Sep 8, 2022

beldzhang commented Sep 8, 2022

isilence commented Sep 9, 2022

beldzhang commented Sep 9, 2022

axboe commented Sep 9, 2022

beldzhang commented Sep 9, 2022

isilence commented Sep 11, 2022

isilence commented Sep 11, 2022

beldzhang commented Sep 11, 2022

beldzhang commented Sep 12, 2022 • edited Loading

beldzhang commented Sep 15, 2022 • edited Loading

beldzhang commented Sep 18, 2022 • edited Loading

beldzhang commented Sep 19, 2022

isilence commented Sep 26, 2022

beldzhang commented Sep 26, 2022

isilence commented Sep 26, 2022

isilence commented Sep 26, 2022 • edited Loading

beldzhang commented Sep 26, 2022

isilence commented Sep 26, 2022

beldzhang commented Sep 26, 2022

isilence commented Sep 26, 2022

beldzhang commented Aug 31, 2022 •

edited

Loading

beldzhang commented Sep 6, 2022 •

edited

Loading

beldzhang commented Sep 12, 2022 •

edited

Loading

beldzhang commented Sep 15, 2022 •

edited

Loading

beldzhang commented Sep 18, 2022 •

edited

Loading

isilence commented Sep 26, 2022 •

edited

Loading