Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

send-zerocopy and linked timeout: how to deal buffer when timeout? #643

Closed
beldzhang opened this issue Aug 25, 2022 · 34 comments
Closed

send-zerocopy and linked timeout: how to deal buffer when timeout? #643

beldzhang opened this issue Aug 25, 2022 · 34 comments

Comments

@beldzhang
Copy link

prepare to upgrade our diskless server to use send-zerocopy

send sqe linked with a timeout sqe to detect network problem, linked-timeout will return ETIME when something happen,
at this point, how to deal the buffer passed in by ...sendzc() ?
can it just be released as regular socket program? or is it still locked(maybe?) by somewhere in kernel stack ?

@isilence
Copy link
Collaborator

prepare to upgrade our diskless server to use send-zerocopy

That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)?

heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).

send sqe linked with a timeout sqe to detect network problem, linked-timeout will return ETIME when something happen, at this point, how to deal the buffer passed in by ...sendzc() ? can it just be released as regular socket program? or is it still locked(maybe?) by somewhere in kernel stack ?

Close() is often is not of much help, but let's say you call shutdown(SHUT_RDWR), then it's reasonable to assume that the socket will release buffers and post a notification in a bound amount of time (that's considering that you flushed notifications and so). As for whether you can reuse buffers immediately after shutdown() without waiting for notification CQEs, it's not currently supported, i.e. you'd need to wait for CQEs, but that's a good question for me to think about later.

@beldzhang
Copy link
Author

beldzhang commented Aug 31, 2022

That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)?
heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).

It is over TCP, payload size was requested by client side, from 512B to 256KB
Already start upgrading codes, sendzc has a noticeable reduce of CPU usage on my server.

Close() is often is not of much help, but let's say you call shutdown(SHUT_RDWR) ...

Will try to simulate this after code upgraded.

another issue found:
because splice consume higher CPU, and sendzc looks to resolve this, I do a re-design to use fixed-buffer,
the linked sqes is:

1) read_fixed(data)
2) sendzc(head + data)
3) link_timeout

when testing with a solid image file and a raw hard drive, sqe 2) returned ECANCELED after processed about 1k~2k requests
already checked 1) has right returned length, and happend with or without 3)
it happens even there is only one client connected
found in kernel: 5.15.63, 5.19.5, 6.0-rc3
EDIT: sqe 2) has problem when using send() instead of sendzc(), and occured with non fixed-buffer also.

BTW, I know sendfile() was not welcomed, after compared with a thread-mode server, sendfile() has the lowest cpu usage, in user-mode it is no need to care about a pipe, and it is useful to many kinds of use case(ex: http server).
so is there any possibility to add this or something like?

@beldzhang
Copy link
Author

kernel....: 5.19.5
liburing..: 2.3-git
tested with READ and READ_FIXED, it's the same.
after change IOSQE_IO_LINK to IOSQE_IO_HARDLINK, this will NOT happen.

each ECANCELED will cause following 2 SQEs failed, SEND and TIMEOUT

[DEVEL-10:~/.../kernel/dell]# perf trace -e io_uring:io_uring_fail_link --call-graph fp
     0.000 iou-0/24077 io_uring:io_uring_fail_link(ctx: 0xffff8883a103c800, req: 0xffff88810ccef600, user_data: 140370983686400, opcode: 22, link: 0xffff88810cceff00, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       DEBUGEND_$NZ_IORING (/root/work/devel/2/common/tools/benckmark/idrive6)
                                       [0] ([unknown])
                                       [0] ([unknown])
     0.008 iou-0/24077 io_uring:io_uring_fail_link(ctx: 0xffff8883a103c800, req: 0xffff88810ccef600, user_data: 140370983686400, opcode: 22, link: 0xffff88810ccee400, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       DEBUGEND_$NZ_IORING (/root/work/devel/2/common/tools/benckmark/idrive6)
                                       [0] ([unknown])
                                       [0] ([unknown])

@beldzhang
Copy link
Author

found notification slot was removed from both kernel and liburing
after some checking until 42ee2bd, there is no any flag to check send_zc was support by current running kernel.
is there any planning to add some flag or just check the kernel version >= 6.0 ?

@axboe
Copy link
Owner

axboe commented Sep 3, 2022

Never rely on a kernel version check. Either just try and the command and see if it fails, or use io_uring_get_probe() and check if that opcode is marked as supported.

@isilence
Copy link
Collaborator

isilence commented Sep 5, 2022

found notification slot was removed from both kernel and liburing after some checking until 42ee2bd, there is no any flag to check send_zc was support by current running kernel. is there any planning to add some flag or just check the kernel version >= 6.0 ?

Agree with Jens on the testing side.

The API should be easier now. If it sends anything, io_uring returns 2 CQEs. The first one is the usual request completion with IORING_CQE_F_MORE set, and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF. If fails (i.e. cqe->res < 0), there will be only one CQE with no IORING_CQE_F_MORE. So, you can look for IORING_CQE_F_MORE to figure out whether there will be one or two CQEs. I'll write a man I think this week.

@isilence
Copy link
Collaborator

isilence commented Sep 5, 2022

That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)?
heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).

It is over TCP, payload size was requested by client side, from 512B to 256KB Already start upgrading codes, sendzc has a noticeable reduce of CPU usage on my server.

Nice!

BTW, I know sendfile() was not welcomed, after compared with a thread-mode server, sendfile() has the lowest cpu usage, in user-mode it is no need to care about a pipe, and it is useful to many kinds of use case(ex: http server). so is there any possibility to add this or something like?

I don't have a strong opinion on that one, it could be implemented but would definitely require more changes than just adding a sendfile io_uring opcode.

@isilence
Copy link
Collaborator

isilence commented Sep 5, 2022

another issue found: because splice consume higher CPU, and sendzc looks to resolve this, I do a re-design to use fixed-buffer, the linked sqes is:

Fixed buffers are faster, but fwiw you can also use normal buffers with sendzc.

1) read_fixed(data)
2) sendzc(head + data)
3) link_timeout

Just a general thought on performance which might not matter for your case but still: linked timeouts are not the fastest thing, so I'd experiment and see if there is a noticeable speedup from not having them. One idea how to optimise it is to have a per-ring timer / timeout request (or probably a couple of them), and if/when they fire you issue a bunch of IORING_OP_ASYNC_CANCEL.

when testing with a solid image file and a raw hard drive, sqe 2) returned ECANCELED after processed about 1k~2k requests already checked 1) has right returned length, and happend with or without 3) it happens even there is only one client connected found in kernel: 5.15.63, 5.19.5, 6.0-rc3 EDIT: sqe 2) has problem when using send() instead of sendzc(), and occured with non fixed-buffer also.

Not entirely sure I get it, what do you mean by "happend with or without 3)"? Or maybe you have a simple reproducer?
It should always post 3 CQEs per such link unless you use IORING_FEAT_CQE_SKIP.

@beldzhang
Copy link
Author

changed to io_uring_get_probe() to check sendzc support, thanks to Jens
will testing new api with 6.0-rc4 this week, and will check is there any performance lost with linked-timeout.

--

for the abnormal ECANCELED, uploaded the testing program.

Not entirely sure I get it, what do you mean by "happend with or without 3)"? Or maybe you have a simple reproducer?

this happened when submit 2 SQEs (read/send) or 3 SQEs(read/send/timeout).
output of testing server:

[DEVEL-10:~/.../uring/tests]# make s && ./server /dev/sda
gcc -g server.c -o server -l uring
 ++ image size = 250059350016
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 / 12413 !!
        resp-data: err=-125

in server.c, sending_timeout set to 1 to include timeout, 0 excluded. tested on 5.19.5.

issue-643-v1.zip

@beldzhang
Copy link
Author

The first one is the usual request completion with IORING_CQE_F_MORE set

in this one cqe.res carry out the right number of sent

and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF

in this one cqe.res is 0, so the use space must kept the sent number from first one to handle protocol data.

is this the designed behavior?

@isilence
Copy link
Collaborator

isilence commented Sep 6, 2022

The first one is the usual request completion with IORING_CQE_F_MORE set

in this one cqe.res carry out the right number of sent

and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF

in this one cqe.res is 0,

is this the designed behavior?

Yes, that's by design, can be changed if there is a good reason for that.

so the use space must kept the sent number from first one to handle protocol data.

Not sure why you'd need the result at that point. The user should not modify data in the buffer until it gets the notification, but it's still allowed to read it, including with syscalls like write / send. The usual scenario on getting a notification should be the userspace freeing / reusing the buffer, and I expect the user knows everything it needs about the buffer like it's length.

@beldzhang
Copy link
Author

beldzhang commented Sep 6, 2022

oh... sorry for not describe clearly
what want to say is the "protocol handling",
in one request-response

recv-protocol-head
if (cqe.res == head_length)
    send-head-and-data
    if (cqe.res == required_length)
        next_loop

after change send-head-and-data to zerocopy, the cqe.res from F_NOTIF one is zero, can not be used to check the proper amount of sent, so save the cqe.res from F_MORE is required.

after more consideration, this is the best way to handle sendzc until now. +1.

@isilence
Copy link
Collaborator

isilence commented Sep 8, 2022

for the abnormal ECANCELED, uploaded the testing program.

this happened when submit 2 SQEs (read/send) or 3 SQEs(read/send/timeout). output of testing server:

[DEVEL-10:~/.../uring/tests]# make s && ./server /dev/sda
gcc -g server.c -o server -l uring
 ++ image size = 250059350016
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 / 12413 !!
        resp-data: err=-125

in server.c, sending_timeout set to 1 to include timeout, 0 excluded. tested on 5.19.5.

issue-643-v1.zip

Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.

@isilence
Copy link
Collaborator

isilence commented Sep 8, 2022

BTW, did you use O_DIRECT on the block device file when you was benchmarking zc send vs sendfile?

@beldzhang
Copy link
Author

Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.

my sda is a USB-SSD disk, and re-tested with a nvme disk on 6.0-rc4, it's the same.
oh, all this test is on a raw-disk WITHOUT any mount point on it.

my system is Slackware 15.0 with customized kernel, might it related?

BTW, did you use O_DIRECT on the block device file when you was benchmarking zc send vs sendfile?

no, never open any image or disk with O_DIRECT

@isilence
Copy link
Collaborator

isilence commented Sep 9, 2022

Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong.

my sda is a USB-SSD disk, and re-tested with a nvme disk on 6.0-rc4, it's the same. oh, all this test is on a raw-disk WITHOUT any mount point on it.

my system is Slackware 15.0 with customized kernel, might it related?

Finally caught the bug and see where it came from. I'd like to add your reported-by tag in the fix-commit so it's in the history that you found the bug and helped the community, but I'd need your full name and email. Looks like:

Reported-by: Pavel Begunkov <my_email@example.com>

@beldzhang
Copy link
Author

very happy to help the community,
My Name: Beld Zhang
E-Mail: beldzhang@gmail.com

thanks to all for the hard works !

@axboe
Copy link
Owner

axboe commented Sep 9, 2022

Thanks, I've updated the commit and also included a link to this issue.

ammarfaizi2 pushed a commit to ammarfaizi2/linux-fork that referenced this issue Sep 9, 2022
We have a couple of problems, first reports of unexpected link breakage
for reads when cqe->res indicates that the IO was done in full. The
reason here is partial IO with retries.

TL;DR; we compare the result in __io_complete_rw_common() against
req->cqe.res, but req->cqe.res doesn't store the full length but rather
the length left to be done. So, when we pass the full corrected result
via kiocb_done() -> __io_complete_rw_common(), it fails.

The second problem is that we don't try to correct res in
io_complete_rw(), which, for instance, might be a problem for O_DIRECT
but when a prefix of data was cached in the page cache. We also
definitely don't want to pass a corrected result into io_rw_done().

The fix here is to leave __io_complete_rw_common() alone, always pass
not corrected result into it and fix it up as the last step just before
actually finishing the I/O.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: axboe/liburing#643
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
@beldzhang
Copy link
Author

Thanks, I've updated the commit and also included a link to this issue.

patched 4d9cb92 to 6.0-rc4, my testing are ok now.

next will check/test error handling of this issue begin with.

@isilence
Copy link
Collaborator

@beldzhang, does it go right? Looks you've deleted a message. Anyway, let me drop a couple of notes here:

  1. It'll be slower for small per-request payloads. So, for TCP, if you're sending 1MB per request, you have nothing to worry about, but 4KB it might be slower. If the payload size is not fixed, I'd recommend to mix normal and zc sends depending on payload sizes. I'm not sure what's the tipping point, but you can try to find it experimentally.

  2. I have a branch with zc sendmsg and so iov, will be sending for 6.1, tell me if it'd help you to increase the payload size.

  3. If you have problems, send send me perf profiles, one for setup with normal sends and one for zc sends. I'll take a look.

  4. Why I asked about O_DIRECT for reads is because w/o it it does an extra copy into the page cache, which is not great. Just fwiw

  5. Do you utilise 100% of CPU / core? You may want to pin the process to one core

@isilence
Copy link
Collaborator

It might also be a latency problem. Don't forget that the second CQE will likely come with a good delay from the first one, and you absolutely don't want to wait for it before for sending more I/O. In other words, don't strictly keep:

  1. submit send
  2. wait send complete
  3. wait for buffer complete
  4. goto 1)

but rather more like

  1. submit send
  2. wait send complete
  3. goto 1
  4. (step 4.5, asynchronous) wait for buffer complete

@beldzhang
Copy link
Author

It might also be a latency problem...

yes, it is the latency... post that message too rush before found out in my hardware environment. using a usb-ethernet that has large tx-timer, previous testing not affectted but sendzc does.

In other words, don't strictly keep: 1. submit send, 2. wait send complete, 3. wait for buffer complete, 4. goto 1)

sadly current is like this, 3) spent long time to wait and cause the speed dropdown. will try to use second model in next devel cycle.

  1. It'll be slower for small per-request payloads...
  2. Why I asked about O_DIRECT for reads...

will testing next week.

already tested error handling and they are all expected. look like we could close this issue successfully soon.

thanks a lot for the notes and advices.

@beldzhang
Copy link
Author

beldzhang commented Sep 12, 2022

  1. error handling pass

  2. sendzc have lower cpu usage even in sync waiting for buffer complete

  3. async waiting of buffer complete is postponed because buffer management redesign required

  4. another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc
    in ...prep_recv() set flag MSG_WAITALL, but still got partial data recv, then following write sqe be canceled
    is it known? any plan to fix or not?

  5. in liburing.h ...prep_sendzc_fixed was removed, is possible add back?

@beldzhang
Copy link
Author

beldzhang commented Sep 15, 2022

Finally caught the bug and see where it came from

the unexpected ECANCELED happened on the testing server with 2 CPU, 256KB random read and linked with send.
nvme0n1 is no mountpoint.

[nzel-server:~/w/devel/2/test/uring-inactive/tests]# uname -r
6.0.0-rc5-64-6
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 /    51 !!
        resp-data: err=-125

testing program attached: issue-643-v2.zip
also include cli_write.c, for

  1. another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# uname -r
5.15.64-64-6
[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
 !! partial recv data:  65450, require: 262144
 !! partial recv data: 163705, require: 196694
 !! partial recv data:  32741, require:  32989
 !! partial recv data: 229344, require: 262144
...

update:
patched and tested: 62bb0647
hope is useful

[nzel-server:~/w/devel/2/test/uring-inactive/tests]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 / 16372 !!
        resp-data: err=-125
[nzel-server:~]# perf trace -e io_uring:io_uring_fail_link --call-graph fp
     0.000 server/1620 io_uring:io_uring_fail_link(ctx: 0xffff888111027800, req: 0xffff888104098e00, user_data: 3, opcode: 22, link: 0xffff888104072400, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       ctx_flush_and_put ([kernel.kallsyms])              <-- this is different with previous
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       io_run_task_work_sig ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       syscall (/lib64/libc-2.33.so)
                                       [0] ([unknown])
                                       [0] ([unknown])
     0.018 server/1620 io_uring:io_uring_fail_link(ctx: 0xffff888111027800, req: 0xffff888104098e00, user_data: 3, opcode: 22, link: 0xffff888104098a00, op_str: "READ")
                                       io_disarm_next ([kernel.kallsyms])
                                       io_disarm_next ([kernel.kallsyms])
                                       io_free_batch_list ([kernel.kallsyms])
                                       __io_submit_flush_completions ([kernel.kallsyms])
                                       ctx_flush_and_put ([kernel.kallsyms])
                                       tctx_task_work ([kernel.kallsyms])
                                       task_work_run ([kernel.kallsyms])
                                       io_run_task_work_sig ([kernel.kallsyms])
                                       __do_sys_io_uring_enter ([kernel.kallsyms])
                                       do_syscall_64 ([kernel.kallsyms])
                                       entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
                                       syscall (/lib64/libc-2.33.so)
                                       [0] ([unknown])
                                       [0] ([unknown])

@beldzhang
Copy link
Author

beldzhang commented Sep 18, 2022

finally found it
in io_read(), if there is no retry, currently only passed the last ret of io_iter_do_read to kiocb_done then failed in __io_complete_rw_common() because ret != cqe.res,
it should be the total amount of all io_iter_do_read got. and only increase io->bytes_done when there is retry.

try to fix it but can not get it done(don't know how to deal with retry).


add some log, found following info
a. not understanding why cqe.res less
b. when res in __io_complete_rw_common shorten, this req was marked fail, and final_ret is not the correct value also.

[nzel-server:~/uring-test]# ./server /dev/nvme0n1
 ++ image size = 2048408248320
STARTING IO_URING....
 ++ client-accepted: 6
     !! unexpected ECANCELED at   1 /    16 !!
        resp-data: err=-125
[  239.661253] ++io_read:  cqe.res(262144)
[  239.700637] ++io_read:  cqe.res(262144)
[  239.700806] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.701070] ++io_read:  cqe.res(262144)
[  239.701320] ++io_read:  cqe.res(262144)
[  239.701431] ++io_read:  cqe.res(  3072)
[  239.701433] ++__io_complete_rw_common: res=  3072, cqe.res=  3072
[  239.701586] ++io_read:  cqe.res(262144)
[  239.701848] ++io_read:  cqe.res(262144)
[  239.701952] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.702135] ++io_read:  cqe.res(262144)
[  239.702323] ++io_read:  cqe.res(262144)
[  239.702436] ++io_read:  cqe.res(  1536)
[  239.702439] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.702565] ++io_read:  cqe.res(262144)
[  239.702788] ++io_read:  cqe.res(262144)
[  239.702897] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.703037] ++io_read:  cqe.res(262144)
[  239.703297] ++io_read:  cqe.res(262144)
[  239.703395] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.703551] ++io_read:  cqe.res(262144)
[  239.703768] ++io_read:  cqe.res(262144)
[  239.703878] ++io_read:  cqe.res(  1024)
[  239.703880] ++__io_complete_rw_common: res=  1024, cqe.res=  1024
[  239.703993] ++io_read:  cqe.res(262144)
[  239.704192] ++io_read:  cqe.res(262144)
[  239.704312] ++io_read:  cqe.res(  2048)
[  239.704315] ++__io_complete_rw_common: res=  2048, cqe.res=  2048
[  239.704426] ++io_read:  cqe.res(262144)
[  239.704623] ++io_read:  cqe.res(262144)
[  239.704752] ++__io_complete_rw_common: res=262144, cqe.res=262144
[  239.704867] ++io_read:  cqe.res(262144)
[  239.705065] ++io_read:  cqe.res(262144)
[  239.705181] ++io_read:  cqe.res(  2560)
[  239.705183] ++__io_complete_rw_common: res=  2560, cqe.res=  2560
[  239.705291] ++io_read:  cqe.res(262144)
[  239.705476] ++io_read:  cqe.res(262144)
[  239.705591] ++io_read:  cqe.res(  1024)
[  239.705593] ++__io_complete_rw_common: res=  1024, cqe.res=  1024
[  239.705695] ++io_read:  cqe.res(262144)
[  239.705905] ++io_read:  cqe.res(262144)
[  239.706030] ++io_read:  cqe.res(  1536)
[  239.706033] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.706109] ++io_read:  cqe.res(262144)
[  239.706290] ++io_read:  cqe.res(262144)
[  239.706391] ++io_read:  cqe.res(  1024)
[  239.706393] ++__io_complete_rw_common: res=  1024, cqe.res=  1024
[  239.706472] ++io_read:  cqe.res(262144)
[  239.706648] ++io_read:  cqe.res(262144)
[  239.706747] ++io_read:  cqe.res(  1536)
[  239.706749] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.706829] ++io_read:  cqe.res(262144)
[  239.707013] ++io_read:  cqe.res(262144)
[  239.707111] ++io_read:  cqe.res(  1536)
[  239.707113] ++__io_complete_rw_common: res=  1536, cqe.res=  1536
[  239.707194] ++io_read:  cqe.res(262144)
[  239.707386] ++io_read:  cqe.res(262144)
[  239.707504] ++__io_complete_rw_common: res=  3072, cqe.res=262144
[  239.707505]  !! __io_complete_rw_common(): res(  3072) != cqe.res(262144)
[  239.707507]    ++ req_set_fail: op=22, flags=40200004
[  239.707508] CPU: 31 PID: 1457 Comm: server Not tainted 6.0.0-rc5-64-6b #21
[  239.707510] Hardware name: Dell Inc. Precision T5600/000000, BIOS A19 06/30/2019
[  239.707512] Call Trace:
[  239.707513]  <TASK>
[  239.707515]  dump_stack_lvl+0x34/0x44
[  239.707518]  __io_complete_rw_common+0xac/0xd0
[  239.707521]  kiocb_done.cold+0xc/0x3c
[  239.707524]  io_read.cold+0x10f/0x2cf
[  239.707526]  ? __kmalloc+0x64/0x4b0
[  239.707529]  ? update_load_avg+0x7e/0x740
[  239.707532]  ? newidle_balance+0x2ba/0x3f0
[  239.707534]  ? io_writev_prep_async+0x80/0x80
[  239.707537]  io_issue_sqe+0x60/0x410
[  239.707540]  ? _raw_spin_unlock+0x12/0x30
[  239.707542]  ? finish_task_switch.isra.0+0xa3/0x2b0
[  239.707545]  io_req_task_submit+0x2a/0x70
[  239.707548]  handle_tw_list+0x9a/0xd0
[  239.707551]  tctx_task_work+0x4e/0x120
[  239.707554]  task_work_run+0x59/0x90
[  239.707557]  io_run_task_work_sig+0x43/0x60
[  239.707559]  __do_sys_io_uring_enter+0x2a2/0x8a0
[  239.707561]  ? percpu_ref_get_many+0x30/0x30
[  239.707564]  do_syscall_64+0x3d/0x90
[  239.707566]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  239.707568] RIP: 0033:0x7f3662bb1c09
[  239.707571] Code: 48 8d 3d 6a 1f 0d 00 0f 05 eb a5 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 2f e2 0c 00 f7 d8 64 89 01 48
[  239.707573] RSP: 002b:00007fff43f32918 EFLAGS: 00000216 ORIG_RAX: 00000000000001aa
[  239.707575] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3662bb1c09
[  239.707577] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000005
[  239.707578] RBP: 0000000000404100 R08: 0000000000000000 R09: 0000000000000008
[  239.707579] R10: 0000000000000001 R11: 0000000000000216 R12: 00007fff43f32970
[  239.707580] R13: 00007fff43f329b8 R14: 0000000000000000 R15: 0000000000000000
[  239.707582]  </TASK>

@beldzhang
Copy link
Author

ok, a wild wild west concept patch that fixed this abnormal ECANCELED, based on 6.0-rc6

--- rw-orig.c	2022-09-18 16:44:14.000000000 -0400
+++ rw.c	2022-09-18 19:53:58.725235656 -0400
@@ -706,7 +706,7 @@
 	struct kiocb *kiocb = &rw->kiocb;
 	bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
 	struct io_async_rw *io;
-	ssize_t ret, ret2;
+	ssize_t ret, ret2, ret_all = 0;
 	loff_t *ppos;
 
 	if (!req_has_async_data(req)) {
@@ -763,6 +763,7 @@
 	}
 
 	ret = io_iter_do_read(rw, &s->iter);
+	if (ret > 0) ret_all += ret;
 
 	if (ret == -EAGAIN || (req->flags & REQ_F_REISSUE)) {
 		req->flags &= ~REQ_F_REISSUE;
@@ -814,7 +815,7 @@
 		iov_iter_advance(&s->iter, ret);
 		if (!iov_iter_count(&s->iter))
 			break;
-		io->bytes_done += ret;
+//		io->bytes_done += ret;
 		iov_iter_save_state(&s->iter, &s->iter_state);
 
 		/* if we can retry, do so with the callbacks armed */
@@ -830,8 +831,11 @@
 		 * here, and if we do, then just retry at the new offset.
 		 */
 		ret = io_iter_do_read(rw, &s->iter);
-		if (ret == -EIOCBQUEUED)
+		if (ret > 0) ret_all += ret;
+		if (ret == -EIOCBQUEUED) {
+			io->bytes_done += ret_all;
 			return IOU_ISSUE_SKIP_COMPLETE;
+		}
 		/* we got some bytes, but not all. retry. */
 		kiocb->ki_flags &= ~IOCB_WAITQ;
 		iov_iter_restore(&s->iter, &s->iter_state);
@@ -840,7 +844,7 @@
 	/* it's faster to check here then delegate to kfree */
 	if (iovec)
 		kfree(iovec);
-	return kiocb_done(req, ret, issue_flags);
+	return kiocb_done(req, ret_all, issue_flags);
 }
 
 int io_write(struct io_kiocb *req, unsigned int issue_flags)

@isilence
Copy link
Collaborator

2. sendzc have lower cpu usage even in sync waiting for buffer complete

Great

3. async waiting of buffer complete is postponed because buffer management redesign required

4. another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc
   in `...prep_recv()` set flag `MSG_WAITALL`, but still got partial data recv, then following write sqe be canceled
   is it known? any plan to fix or not?

The short read fix hasn't been ported to older kernels. Are you using a some stable kernel?

5. in liburing.h `...prep_sendzc_fixed` was removed, is possible add back?

Yeah, can add it back

@beldzhang
Copy link
Author

oh, there are two different issues:

  1. another short recv found in 5.15.67, cause a recv-write link broken, not found in 5.19 and 6.0-rc ...

this is found recently, and after comparing source between this versions, looks like it is difficult to backport due to huge changes, so please ignore it.


The short read fix hasn't been ported to older kernels.

the unexpected ECANCELED happened on the testing server with 2 CPU, 256KB random read and linked with send.

the short read is not completely fixed yet, post some messages start from here: #643 (comment)
testing codes already included. issue-643-v2.zip

in my debugging, found the bytes_done and req->cqe.res are lost syncing when there retries ...

Are you using a some stable kernel?

currently we are running 5.15.x and preparing upgrade to 6.0 or 6.1.

@isilence
Copy link
Collaborator

Ah, I see. For backports, I do have them, just didn't send out as was waiting for another patch that it needs to land upstream

@isilence
Copy link
Collaborator

isilence commented Sep 26, 2022

finally found it in io_read(), if there is no retry, currently only passed the last ret of io_iter_do_read to kiocb_done then failed in __io_complete_rw_common() because ret != cqe.res, it should be the total amount of all io_iter_do_read got. and only increase io->bytes_done when there is retry.

cqe.res stores the numbers of bytes we're currently trying to do, so how about this?

diff --git a/io_uring/rw.c b/io_uring/rw.c
index 76ebcfebc9a6..c562203d7a67 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -823,6 +823,7 @@ int io_read(struct io_kiocb *req, unsigned int issue_flags)
 			return -EAGAIN;
 		}
 
+		req->cqe.res = iov_iter_count(&s->iter);
 		/*
 		 * Now retry read with the IOCB_WAITQ parts set in the iocb. If
 		 * we get -EIOCBQUEUED, then we'll get a notification when the

I also pushed to a branch with an explanation for convenience.

isilence/linux@9a70c23

edit: branch
https://github.com/isilence/linux.git rw-link-break

@beldzhang
Copy link
Author

cqe.res stores the numbers of bytes we're currently trying to do, so how about this?

testing pass, ~60GB reading at 256KB block for 3 times, patched on 6.0-rc7.

btw, since never met issues on io_write so I think that part are all good.

@isilence
Copy link
Collaborator

Perfect, can I add your tested-by tag?

@beldzhang
Copy link
Author

Perfect, can I add your tested-by tag?

yes, very glad of this.

@isilence
Copy link
Collaborator

Let's close this issue, it long overstepped the initial purpose, you can open a new open if you get some problems. And I'll be quite curious to hear about results you may get out of it in the future.

ollie-etl pushed a commit to etlsystems/linux that referenced this issue Oct 3, 2022
We have a couple of problems, first reports of unexpected link breakage
for reads when cqe->res indicates that the IO was done in full. The
reason here is partial IO with retries.

TL;DR; we compare the result in __io_complete_rw_common() against
req->cqe.res, but req->cqe.res doesn't store the full length but rather
the length left to be done. So, when we pass the full corrected result
via kiocb_done() -> __io_complete_rw_common(), it fails.

The second problem is that we don't try to correct res in
io_complete_rw(), which, for instance, might be a problem for O_DIRECT
but when a prefix of data was cached in the page cache. We also
definitely don't want to pass a corrected result into io_rw_done().

The fix here is to leave __io_complete_rw_common() alone, always pass
not corrected result into it and fix it up as the last step just before
actually finishing the I/O.

Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: axboe/liburing#643
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants