Soft deadlock with IORING_SETUP_IOPOLL and IORING_SETUP_ATTACH_WQ #367

lano1106 · 2021-06-19T15:28:37Z

I have 2 userspace threads each using their own private io_uring instance. Those 2 io_uring instances are sharing a single sqpoll thread.

I added an optimization to my io_uring_enter() calls for waiting for completions by setting min_complete to 1 + number of IORING_OP_PROVIDE_BUFFERS submitted based on the observation that IORING_OP_PROVIDE_BUFFERS execution is certain and immediate. Therefore, it was useless to stop waiting for their completion.

Since this change, I have started to experience the following. The change seems to work nicely. My userspace threads cpu usage has significantly decreased as a result.

However, after some time (roughly once ever 30 minutes), everything stops. sqpoll goes to sleep, my 2 userspace threads are waiting for completion inside io_uring_enter().

What unblocks this is the specified 30 seconds timeout. Once the wait times out, my threads see that no new data came in in the last 30 seconds, therefore, they declare the connections dead and initiate reconnections ( #363 ). Once all the threads wake up, pending reads completion comes in and shows that the connections weren't really dead... I just ended up in some weird deadlock situation.

I don't know exactly what is happening. I'm just going to use this issue to document my debugging.

sq_thread_idle is set to 5.
(An obvious potential experiment is to increase that value)

QD is 256
1 thread has 53 fds so its max theorical min_complete value would be 54
the other has 3 fds so its max theorical min_complete value would be 4

but even if it goes to sleep, new completion should wake up sqpoll, no?
the only other scenario that I can think of is if the CQ ring ended being full. That would be the deadlock cause but with my number of fds where each can only have 1 pending read + 1 PROVIDE_BUFFERS sqe, I have a hard time seeing how this could happen...

So this is the current situation, I'll update this as I discover new stuff during my investigation...

lano1106 · 2021-06-19T15:37:33Z

actually, 1 provide_buffer sqe per fd is inaccurate. It could be more. Read buffers are stored into an openssl custom BIO object and they can be batch released.

The power of documenting problems is so potent to put in your face invalid assumptions...

I'll keep you updated

lano1106 · 2021-06-19T17:35:37Z

I have some new info...

The service that I am connecting to has 2 platforms. Beta and Prod. I always connect to beta because I did find out that it was providing a multi-msec latency edge over prod... The downside is that sometimes the service is buggy as they are testing the next release on it.

I just flipped the service I connect to from beta to prod and I didn't experience the deadlock issue for the last 2-3 hours. This doesn't automatically mean that my issue comes from the service and the client-side is flawless. Until proven otherwise, I think that the client-side has an issue that gets triggered by a different sequence of events found on beta.

The 3 fds io_uring instance is always connected to prod anyway. If a full CQ leads to a deadlock, that would mean 1 full CQ could stall all the other attached io_uring instances...

I am going to validate this by looking at the code...

lano1106 · 2021-06-20T03:53:20Z

I think that I have found the problem and I am about to have a solution for it but in the meantime, here are few observations that I have made during my investigation.

in io_sq_thread():

			mutex_unlock(&sqd->lock);
			if (signal_pending(current)) {
				struct ksignal ksig;

				did_sig = get_signal(&ksig);
			}
			cond_resched();
			mutex_lock(&sqd->lock);
			io_run_task_work();

The first thing that get_signal() does is calling task_work_run() so if all the validations that io_run_task_work() does are important, they get bypassed. Also, I was wondering in which context io_run_task_work() could be called where the current task state is not TASK_RUNNING...

If this code is considered to be in the hot path, there might be some optimizations possible by avoiding calling task_work_run() twice.

I have been surprised to discover that if there are task_works, this wasn't considered as an activity that would push back the idle timeout. Does this has been considered?

At least, simple workaround for a user wanting more spinning is to increase the sq_thread_idle value.

isilence · 2021-06-22T11:28:33Z

completions should wake the SQPOLL task, see io_cqring_ev_posted*().

If this code is considered to be in the hot path, there might be some optimizations possible by avoiding calling task_work_run() twice.

that chunk with signal_pending(), currently in io_sqd_handle_event(), should be very rare

I have been surprised to discover that if there are task_works, this wasn't considered as an activity that would push back the idle timeout. Does this has been considered?

task_works and sqpoll idle deadline pushing -- may be good

Also, I was wondering in which context io_run_task_work() could be called where the current task state is not TASK_RUNNING...

I don't remember all that off the bat, but one example is io_cqring_wait_schedule(), see where it's called and how TASK_INTERRUPTIBLE is set just before it.

The first thing that get_signal() does is calling task_work_run() so if all the validations that io_run_task_work() does are important, they get bypassed

They're not important in your case. PF_EXITING can only be there where the task is dying and so not running io_sq_thread(), used for for cancellations. The second if is a fast check, bypassing it is slower but does no harm.

the only other scenario that I can think of is if the CQ ring ended being full

to test out your hypothesis, you can print sq_ring->flags from the userspace to see if
IORING_SQ_NEED_WAKEUP and/or IORING_SQ_CQ_OVERFLOW are set.

struct io_uring *ring = ...;
flags = IO_URING_READ_ONCE(*ring->sq.kflags);

The 3 fds io_uring instance is always connected to prod anyway. If a full CQ leads to a deadlock, that would mean 1 full CQ could stall all the other attached io_uring instances...

IORING_SETUP_ATTACH_WQ? If so then it's not how it supposed to work. All overflowed CQEs will be stored in an internal list, and the userspace will see IORING_SQ_CQ_OVERFLOW set. See cq_ring_needs_flush() and it's uses in liburing. SQPOLL will happily continue executing requests of all attached rings, including requests of the ring with full CQ.
Well, unless there is a bug not waking it or something.

lano1106 · 2021-06-22T18:01:52Z

Pavel,

you are correct. signal_pending() is quite rare. I was still in the mindset that adding a task was setting the infamous TIF_NOTIFY_SIGNAL bit.

I lost sight that this was done only when the current task is not the sq thread and this has sent my investigation in the wrong direction at first.

Your intuition is correct. I did indeed stumble into a race condition ending with the sq thread not being wake up. I have absolutely no reason why I suddenly stumble into the perfect timing making the problem happen hundreds of times every day only very recently...

I have tested the fix. The problem has not manifested itself for the last 4 hours so I think that after few unsuccessful attempts, I have finally found a winner.

Expect a patch very soon!

lano1106 closed this as completed Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soft deadlock with IORING_SETUP_IOPOLL and IORING_SETUP_ATTACH_WQ #367

Soft deadlock with IORING_SETUP_IOPOLL and IORING_SETUP_ATTACH_WQ #367

lano1106 commented Jun 19, 2021 •

edited

lano1106 commented Jun 19, 2021 •

edited

lano1106 commented Jun 19, 2021

lano1106 commented Jun 20, 2021 •

edited

isilence commented Jun 22, 2021

lano1106 commented Jun 22, 2021 •

edited

Soft deadlock with IORING_SETUP_IOPOLL and IORING_SETUP_ATTACH_WQ #367

Soft deadlock with IORING_SETUP_IOPOLL and IORING_SETUP_ATTACH_WQ #367

Comments

lano1106 commented Jun 19, 2021 • edited

lano1106 commented Jun 19, 2021 • edited

lano1106 commented Jun 19, 2021

lano1106 commented Jun 20, 2021 • edited

isilence commented Jun 22, 2021

lano1106 commented Jun 22, 2021 • edited

lano1106 commented Jun 19, 2021 •

edited

lano1106 commented Jun 19, 2021 •

edited

lano1106 commented Jun 20, 2021 •

edited

lano1106 commented Jun 22, 2021 •

edited