New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Soft deadlock with IORING_SETUP_IOPOLL and IORING_SETUP_ATTACH_WQ #367
Comments
actually, 1 provide_buffer sqe per fd is inaccurate. It could be more. Read buffers are stored into an openssl custom BIO object and they can be batch released. The power of documenting problems is so potent to put in your face invalid assumptions... I'll keep you updated |
I have some new info... The service that I am connecting to has 2 platforms. Beta and Prod. I always connect to beta because I did find out that it was providing a multi-msec latency edge over prod... The downside is that sometimes the service is buggy as they are testing the next release on it. I just flipped the service I connect to from beta to prod and I didn't experience the deadlock issue for the last 2-3 hours. This doesn't automatically mean that my issue comes from the service and the client-side is flawless. Until proven otherwise, I think that the client-side has an issue that gets triggered by a different sequence of events found on beta. The 3 fds io_uring instance is always connected to prod anyway. If a full CQ leads to a deadlock, that would mean 1 full CQ could stall all the other attached io_uring instances... I am going to validate this by looking at the code... |
I think that I have found the problem and I am about to have a solution for it but in the meantime, here are few observations that I have made during my investigation. in io_sq_thread():
The first thing that get_signal() does is calling task_work_run() so if all the validations that io_run_task_work() does are important, they get bypassed. Also, I was wondering in which context io_run_task_work() could be called where the current task state is not TASK_RUNNING... If this code is considered to be in the hot path, there might be some optimizations possible by avoiding calling task_work_run() twice. I have been surprised to discover that if there are task_works, this wasn't considered as an activity that would push back the idle timeout. Does this has been considered? At least, simple workaround for a user wanting more spinning is to increase the sq_thread_idle value. |
completions should wake the SQPOLL task, see
that chunk with
task_works and sqpoll idle deadline pushing -- may be good
I don't remember all that off the bat, but one example is io_cqring_wait_schedule(), see where it's called and how TASK_INTERRUPTIBLE is set just before it.
They're not important in your case. PF_EXITING can only be there where the task is dying and so not running
to test out your hypothesis, you can print sq_ring->flags from the userspace to see if struct io_uring *ring = ...;
flags = IO_URING_READ_ONCE(*ring->sq.kflags);
IORING_SETUP_ATTACH_WQ? If so then it's not how it supposed to work. All overflowed CQEs will be stored in an internal list, and the userspace will see IORING_SQ_CQ_OVERFLOW set. See |
Pavel, you are correct. signal_pending() is quite rare. I was still in the mindset that adding a task was setting the infamous TIF_NOTIFY_SIGNAL bit. I lost sight that this was done only when the current task is not the sq thread and this has sent my investigation in the wrong direction at first. Your intuition is correct. I did indeed stumble into a race condition ending with the sq thread not being wake up. I have absolutely no reason why I suddenly stumble into the perfect timing making the problem happen hundreds of times every day only very recently... I have tested the fix. The problem has not manifested itself for the last 4 hours so I think that after few unsuccessful attempts, I have finally found a winner. Expect a patch very soon! |
I have 2 userspace threads each using their own private io_uring instance. Those 2 io_uring instances are sharing a single sqpoll thread.
I added an optimization to my io_uring_enter() calls for waiting for completions by setting min_complete to 1 + number of IORING_OP_PROVIDE_BUFFERS submitted based on the observation that IORING_OP_PROVIDE_BUFFERS execution is certain and immediate. Therefore, it was useless to stop waiting for their completion.
Since this change, I have started to experience the following. The change seems to work nicely. My userspace threads cpu usage has significantly decreased as a result.
However, after some time (roughly once ever 30 minutes), everything stops. sqpoll goes to sleep, my 2 userspace threads are waiting for completion inside io_uring_enter().
What unblocks this is the specified 30 seconds timeout. Once the wait times out, my threads see that no new data came in in the last 30 seconds, therefore, they declare the connections dead and initiate reconnections ( #363 ). Once all the threads wake up, pending reads completion comes in and shows that the connections weren't really dead... I just ended up in some weird deadlock situation.
I don't know exactly what is happening. I'm just going to use this issue to document my debugging.
sq_thread_idle is set to 5.
(An obvious potential experiment is to increase that value)
QD is 256
1 thread has 53 fds so its max theorical min_complete value would be 54
the other has 3 fds so its max theorical min_complete value would be 4
but even if it goes to sleep, new completion should wake up sqpoll, no?
the only other scenario that I can think of is if the CQ ring ended being full. That would be the deadlock cause but with my number of fds where each can only have 1 pending read + 1 PROVIDE_BUFFERS sqe, I have a hard time seeing how this could happen...
So this is the current situation, I'll update this as I discover new stuff during my investigation...
The text was updated successfully, but these errors were encountered: