-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
send-zerocopy and linked timeout: how to deal buffer when timeout? #643
Comments
That's great, I'd love to hear about your use case. TCP? What is your usual payload size per send (e.g. io_uring request)? heads up, there is some discussion going on the API and the final version still may change (in the direction of simplification).
Close() is often is not of much help, but let's say you call |
It is over TCP, payload size was requested by client side, from 512B to 256KB
Will try to simulate this after code upgraded. another issue found:
when testing with a solid image file and a raw hard drive, sqe 2) returned ECANCELED after processed about 1k~2k requests BTW, I know |
each
|
found notification slot was removed from both kernel and liburing |
Never rely on a kernel version check. Either just try and the command and see if it fails, or use io_uring_get_probe() and check if that opcode is marked as supported. |
Agree with Jens on the testing side. The API should be easier now. If it sends anything, io_uring returns 2 CQEs. The first one is the usual request completion with IORING_CQE_F_MORE set, and the second one is the "buffer-free notification" with the same cqe->user_data and IORING_CQE_F_NOTIF. If fails (i.e. cqe->res < 0), there will be only one CQE with no IORING_CQE_F_MORE. So, you can look for IORING_CQE_F_MORE to figure out whether there will be one or two CQEs. I'll write a man I think this week. |
Nice!
I don't have a strong opinion on that one, it could be implemented but would definitely require more changes than just adding a sendfile io_uring opcode. |
Fixed buffers are faster, but fwiw you can also use normal buffers with sendzc.
Just a general thought on performance which might not matter for your case but still: linked timeouts are not the fastest thing, so I'd experiment and see if there is a noticeable speedup from not having them. One idea how to optimise it is to have a per-ring timer / timeout request (or probably a couple of them), and if/when they fire you issue a bunch of IORING_OP_ASYNC_CANCEL.
Not entirely sure I get it, what do you mean by "happend with or without 3)"? Or maybe you have a simple reproducer? |
changed to -- for the abnormal
this happened when submit 2 SQEs (read/send) or 3 SQEs(read/send/timeout).
in server.c, |
in this one cqe.res carry out the right number of sent
in this one cqe.res is 0, so the use space must kept the sent number from first one to handle protocol data. is this the designed behavior? |
Yes, that's by design, can be changed if there is a good reason for that.
Not sure why you'd need the result at that point. The user should not modify data in the buffer until it gets the notification, but it's still allowed to read it, including with syscalls like write / send. The usual scenario on getting a notification should be the userspace freeing / reusing the buffer, and I expect the user knows everything it needs about the buffer like it's length. |
oh... sorry for not describe clearly
after change after more consideration, this is the best way to handle sendzc until now. +1. |
Tried it out but doesn't reproduce for me, and I don't immediately see anything wrong. |
BTW, did you use O_DIRECT on the block device file when you was benchmarking zc send vs sendfile? |
my sda is a USB-SSD disk, and re-tested with a nvme disk on 6.0-rc4, it's the same. my system is Slackware 15.0 with customized kernel, might it related?
no, never open any image or disk with O_DIRECT |
Finally caught the bug and see where it came from. I'd like to add your reported-by tag in the fix-commit so it's in the history that you found the bug and helped the community, but I'd need your full name and email. Looks like:
|
very happy to help the community, thanks to all for the hard works ! |
Thanks, I've updated the commit and also included a link to this issue. |
We have a couple of problems, first reports of unexpected link breakage for reads when cqe->res indicates that the IO was done in full. The reason here is partial IO with retries. TL;DR; we compare the result in __io_complete_rw_common() against req->cqe.res, but req->cqe.res doesn't store the full length but rather the length left to be done. So, when we pass the full corrected result via kiocb_done() -> __io_complete_rw_common(), it fails. The second problem is that we don't try to correct res in io_complete_rw(), which, for instance, might be a problem for O_DIRECT but when a prefix of data was cached in the page cache. We also definitely don't want to pass a corrected result into io_rw_done(). The fix here is to leave __io_complete_rw_common() alone, always pass not corrected result into it and fix it up as the last step just before actually finishing the I/O. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: axboe/liburing#643 Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
patched 4d9cb92 to 6.0-rc4, my testing are ok now. next will check/test error handling of this issue begin with. |
@beldzhang, does it go right? Looks you've deleted a message. Anyway, let me drop a couple of notes here:
|
It might also be a latency problem. Don't forget that the second CQE will likely come with a good delay from the first one, and you absolutely don't want to wait for it before for sending more I/O. In other words, don't strictly keep:
but rather more like
|
yes, it is the latency... post that message too rush before found out in my hardware environment. using a usb-ethernet that has large tx-timer, previous testing not affectted but sendzc does.
sadly current is like this, 3) spent long time to wait and cause the speed dropdown. will try to use second model in next devel cycle.
will testing next week. already tested error handling and they are all expected. look like we could close this issue successfully soon. thanks a lot for the notes and advices. |
|
the unexpected
testing program attached: issue-643-v2.zip
update:
|
finally found it try to fix it but can not get it done(don't know how to deal with retry). add some log, found following info
|
ok, a wild wild west concept patch that fixed this abnormal
|
Great
The short read fix hasn't been ported to older kernels. Are you using a some stable kernel?
Yeah, can add it back |
oh, there are two different issues:
this is found recently, and after comparing source between this versions, looks like it is difficult to backport due to huge changes, so please ignore it.
the short read is not completely fixed yet, post some messages start from here: #643 (comment) in my debugging, found the
currently we are running 5.15.x and preparing upgrade to 6.0 or 6.1. |
Ah, I see. For backports, I do have them, just didn't send out as was waiting for another patch that it needs to land upstream |
cqe.res stores the numbers of bytes we're currently trying to do, so how about this? diff --git a/io_uring/rw.c b/io_uring/rw.c
index 76ebcfebc9a6..c562203d7a67 100644
--- a/io_uring/rw.c
+++ b/io_uring/rw.c
@@ -823,6 +823,7 @@ int io_read(struct io_kiocb *req, unsigned int issue_flags)
return -EAGAIN;
}
+ req->cqe.res = iov_iter_count(&s->iter);
/*
* Now retry read with the IOCB_WAITQ parts set in the iocb. If
* we get -EIOCBQUEUED, then we'll get a notification when the
I also pushed to a branch with an explanation for convenience. edit: branch |
testing pass, ~60GB reading at 256KB block for 3 times, patched on 6.0-rc7. btw, since never met issues on |
Perfect, can I add your tested-by tag? |
yes, very glad of this. |
Let's close this issue, it long overstepped the initial purpose, you can open a new open if you get some problems. And I'll be quite curious to hear about results you may get out of it in the future. |
We have a couple of problems, first reports of unexpected link breakage for reads when cqe->res indicates that the IO was done in full. The reason here is partial IO with retries. TL;DR; we compare the result in __io_complete_rw_common() against req->cqe.res, but req->cqe.res doesn't store the full length but rather the length left to be done. So, when we pass the full corrected result via kiocb_done() -> __io_complete_rw_common(), it fails. The second problem is that we don't try to correct res in io_complete_rw(), which, for instance, might be a problem for O_DIRECT but when a prefix of data was cached in the page cache. We also definitely don't want to pass a corrected result into io_rw_done(). The fix here is to leave __io_complete_rw_common() alone, always pass not corrected result into it and fix it up as the last step just before actually finishing the I/O. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: axboe/liburing#643 Reported-by: Beld Zhang <beldzhang@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
prepare to upgrade our diskless server to use send-zerocopy
send sqe linked with a timeout sqe to detect network problem, linked-timeout will return ETIME when something happen,
at this point, how to deal the buffer passed in by
...sendzc()
?can it just be released as regular socket program? or is it still locked(maybe?) by somewhere in kernel stack ?
The text was updated successfully, but these errors were encountered: