-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Too many threads with IOSQE_ASYNC and bad performance without... #420
Comments
What kernel is this? Since you did profiling, care to share those profiles? I'm assuming it's the actual request type handler that ends up being expensive, for the submission side. But would need to go a bit deeper than that to figure out what the real problem is here. For IOSQE_ASYNC, it really depends on the request type. For bounded execution time, the number of threads is limited. For unbounded (like network, poll, etc), there's really not. We could add something like that, but it might work better if you just define how many inflight you want, and then mark appropriately in the app. That said, should be very feasible to add a way to have io_uring_register() have a type that can get/set the max unbounded number of threads. If you want to test, I can surely do that. |
I've tested with 5.12.13 and 5.13.9. The request types are TCP socket and disk file I/O, including splice using intermediate pipes (my implementation of sendfile). Splice and disk I/O always get offloaded to a (single I think) worker thread. Network I/O's io_write and io_read is what takes up CPU time from the main thread when I'm not using IOSQE_ASYNC. On the flamegraph I see io_uring_submit calls into io_read->tcp_recvmsg and io_write->tcp_sendmsg. I can send you a flamegraph svg privately if you want. |
Having an option to limit the number of threads would be great! Would that be so that each worker thread has their own work queue, or will new requests run synchronously when all worker threads are occupied? |
Something like this:
|
Please do send the flamegraph, but it also sounds like a case of "the networking side is expensive"... The way it works is that there's a queue per node for each ring, and if we queue work and there's no available worker, then we'll see if we need to create one. If we do, then a new one is created and it'll handle that new work. If we cannot create a new worker (like would happen if you set the count lower with the above patch), then the newly added work would not get executed until a previous worker of the same type has finished what it is doing. |
I'll test the posted patch and send it out for review. Would be great if you could tell me what your email is, I like to put Reported-by etc tags in the actual commit to help show who requested/reviewed/etc what. |
Tested the patch and posted it. |
Thanks! I'll test it too today or over the weekend. By the way, is the limit set by IORING_REGISTER_IOWQ_MAX_UNBOUND global or per process? |
The limit is per-ring |
Actually, to be more specific, it's per node per ring. So if you have a dual core system with 2 nodes, then the limit is really doubled if you keep both nodes loaded on that particular ring. Hope that makes sense... |
Makes sense, thanks! |
io-wq divides work into two categories: 1) Work that completes in a bounded time, like reading from a regular file or a block device. This type of work is limited based on the size of the SQ ring. 2) Work that may never complete, we call this unbounded work. The amount of workers here is just limited by RLIMIT_NPROC. For various uses cases, it's handy to have the kernel limit the maximum amount of pending unbounded workers. Provide a way to do with with a new IORING_REGISTER_IOWQ_MAX_UNBOUND operation. IORING_REGISTER_IOWQ_MAX_UNBOUND takes an integer and sets the max worker count to what is being passed in, and returns the old value (or an error). If 0 is being passed in, it simply returns the current value. The value is capped at RLIMIT_NPROC. This actually isn't that important as it's more of a hint, if we're exceeding the value then our attempt to fork a new worker will fail. This happens naturally already if more than one node is in the system, as these values are per-node internally for io-wq. Reported-by: Johannes Lundberg <johalun0@gmail.com> Link: axboe/liburing#420 Signed-off-by: Jens Axboe <axboe@kernel.dk>
I applied the patch to 5.14-rc7 and built the kernel. The io_uring_register command gives me correct before and after values, however, my application still create new threads for each submission resulting in 100's or 1000's of threads created. |
Just so we're clear, this is only for the unbounded workers. The bounded workers are still capped by the SQ ring size, in detail:
I need to double check, but we had a few worker creation fixes recently, but they should be in -rc7. Just for sake of completeness, can you try and pull: git://git.kernel.dk/linux-block for-5.15/io_uring into a v5.14-rc7 branch on your end and test with that? If it still fails to properly cap your situation, I'll take a closer look. The basic test case I wrote worked just fine, but there could be differences in what is being queued. |
My silly test case:
which basically just queues 1024 reads that will go async. Since they are on a pipe, execution time is unbounded. When the reads have been queued, we check how many worker threads are running. Sample run:
|
If this is a bounded issue as well, we can modify the patch to allow setting both of them. Here's an updated version you can apply to 5.14-rc7 and test. Note that this one requires you to pass in an int max_workers[2] array instead, where index 0 is the bounded count, and 1 is the unbounded count. The return value is now always 0/-1, and the previous count is copied into the array on success instead.
|
Your test case is correct for as well. But I tried with only setting the IOSQE_ASYNC flag for reading from a tcp socket, but still several 100 threads are created. I will try the updated patch. IIUC, network read should be unbounded, correct? |
By the way, this is a 24 CPU system so bounded should be limited to 96 threads IIUC (my queue depth 16K). By counting the threads during heavy load this seem to be correct. |
Changed the test case to use a TCP socket, even though it really shouldn't make a difference. And it does the right thing there for me too. If I don't set IOSQE_ASYNC, then they are all queued in polled mode, which doesn't create any workers (that's expected). If I set IOSQE_ASYNC, then it limits the workers as instructed:
Puzzled... |
I merged the 5.15 io_uring branch in to a clean 5.14-rc7 but still the same. Even limit bound and unbound didn't change anything. I only have IOSQE_ASYNC on for TCP socket read but still.. :( By the way, I'm using io_uring_prep_XXX for all read/write etc. |
I limit bound and unbound to 2 and still seeing over 60 iou_wrk thread without using IOSQE_ASYNC at all. |
The weird thing is, your test works. Threads never exceed specified max value. |
I saw in your test you never set |
IOSQE_ASYNC isn't needed for the pipe test, just for the socket one. But yeah, either one works for me, pipes or sockets. And they really should, since there's no difference at that level. Hence I am a bit puzzled on what is going on at your end. Are you sure you're setting it on the right ring, if you have more than one? Maybe describe your setup a bit, it might help me understand why it behaves differently. |
Let me generate a small incremental patch that just adds some tracing to this. |
It's just a single ring, single thread. A couple of listeners accepting incoming TCP connections. I got
I was playing around with |
Try this one:
And then after booting and before running your test do: echo 1 > /sys/kernel/debug/tracing/events/io_uring/io_uring_wq/enableRun the test that shows too many workers, then attach a compressed copy of /sys/kernel/debug/tracing/trace in here. |
Not sure what happened to the formatting there... Anyway, if you run the above trace addition, hopefully it'll shed some light on why it doesn't work. |
Patch that actually links...:
|
At least that's something! Can you add this one on top as well? Curious if this is hashing related or not.
|
No change |
Ok thanks for testing, that's a helpful datapoint |
Is this compared to not using IOSQE_ASYNC? Can you run a |
Yes, everything is the same except enable/disable IOSQE_ASYNC for TCP socket reads only. Emailed the perf data to you. |
Can you give the current for-5.15/io_uring branch a go? I looked at your profile, and nothing really looks too busy. When you run with IOSQE_ASYNC, how many workers are active? Maybe try and do a There's another potential solution here. I'm not particularly a big fan of IOSQE_ASYNC. It's fine for some cases, but it tends to be a bit inefficient in various ways (for pollable file types). Hence I've been thinking about a hybrid kind of model. Right now, these are the two outcomes of queueing a request against a socket:
The hybrid model would combine the two. Let's call it IOSQE_ASYNC_HYBRID. If that is set, then we still send it directly to an iou-wrk worker. However, that worker will go through the same motions as the non-IOSQE_ASYNC path, and arm poll if we cannot read/write the socket. This leaves the worker free to perform other requests, as the async trigger will ensure it runs the work immediately when we get the notification. An alternative approach is having IOSQE_ASYNC_HYBRID go through the normal submit attempt first, in the hope that data/space is available. If not, then it sends to iou-wrk and it arms poll. Slight twist on the above. This would work for unbounded work that can get interrupted, it would not work for bounded work. But bounded work isn't pollable anyway, so the existing IOSQE_ASYNC is the right model for that. |
I will give it a try. Yeah, that's what I'm seeing too. Nothing is really busy on the system, but requests per second is just way down. Some connections even timeout (with a 10s connection time limit set by I've running with worker limit of 24 bound and 24 unbound for all test recently. I run ps -T -p pid while the test is running and I see the number of threads go from 3 (at start of program) to 51 consistently when the system gets busy (with or without IOSQE_ASYNC it spawns 48 workers). |
Back to worst case again :( Correction regarding the last comment, with IOSQE_ASYNC, workers do come and go during the test. Without it, number of workers are consistently maxed out. I didn't really see anything of interest with by limiting perf to a worker pid. |
What sha did you run? I noticed a stupid last minute error made it in there... This should be the right HEAD:
|
That's the one. Maybe it's more or less like the other day. It kinds of depends on how much is cached and for cached objects, sendfile (splice) is being used. I cleared the cache and ran a few times again and got similar results as yesterday. |
Hmm, found something interesting! So far have only enabled IOSQE_ASYNC for TCP socket reads. This time I tried enable it only for TCP socket writes. Now, performance is great, no issues like with for async read. I can also see in my flamegraph that io_write (for TCP) is not called anymore from main thread's call to io_uring_submit but instead from the worker threads. |
Ok, for forced async writes it looks like it is still waiting a little bit in io_uring_submit when it really shouldn't (lots of work to do, but it doesn't do it so request / sec gets lower). Still way better and more stable than when forcing async for reads. |
When forcing async for connect I'm getting connect timeouts in my application. Even at very low load from the client but never for individual requests. |
I tried with SQPOLL and it seems the worker limit does not apply in that case, I'm getting 1000+ threads again. Edit: without using IOSQE_ASYNC at all, I'm still getting 250+ threads during load with SQPOLL. |
I'll get to the rest tomorrow, but yeah we do need to handle it a bit differently for SQPOLL, it currently doesn't set the right spots. I'll queue up a patch for that and let you know, so you can test it. Thanks! |
Can you try with this applied? Totally untested... |
SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: axboe/liburing#420 Reported-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Using SQPOLL now respects worker limit 👍🏻 |
Perfect, thanks for testing! |
SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: axboe/liburing#420 Reported-by: Johannes Lundberg <johalun0@gmail.com> Tested-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: axboe/liburing#420 Reported-by: Johannes Lundberg <johalun0@gmail.com> Tested-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: axboe/liburing#420 Fixes: 2e48005 ("io-wq: provide a way to limit max number of workers") Reported-by: Johannes Lundberg <johalun0@gmail.com> Tested-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: axboe/liburing#420 Fixes: 2e48005 ("io-wq: provide a way to limit max number of workers") Reported-by: Johannes Lundberg <johalun0@gmail.com> Tested-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: axboe/liburing#420 Fixes: 2e48005 ("io-wq: provide a way to limit max number of workers") Reported-by: Johannes Lundberg <johalun0@gmail.com> Tested-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hey! Have you had time to work on this anymore (the async read/write slowness issue), or the hybrid model you mentioned earlier? |
I haven't yet had time to dive into that. I think the model is working much better with the most recent patches, though. I'm still interested in exploring the hybrid model, but I need to spend some time on that. We should probably close out this issue, I don't like having multiple or never ending issues when things get fixed and new things prop up. Would be great if you could open a new one for what comes next. |
First, I don't want to use SQPOLL since that would have a CPU spinning at 100% all the time (our application is never idle, even at low util).
So I was having poor performance and profiling the app I see io_uring_submit spends a lot of time in the kernel. Like 30% of the main threads time is spent there. Then I found IOSQE_ASYNC which sounded great, until I realized it spawns a thread for each request, meaning 10000+ threads getting created on the system. I was expecting IOSQE_ASYNC to offload the request to a fixed number of worker threads. Is there not such an option? If not, what are the other alternatives to reduce main threads time spent in the kernel?
Thanks!
The text was updated successfully, but these errors were encountered: