performance degradation starting from 5.7.16 #215

romange · 2020-09-27T22:43:45Z

Hi Jens,

I noticed that there is serious performance regression using sockets with io_uring starting from v5.7.16.
Unfortunately, it continues into 5.8 and 5.9 I have not checked yet for which patch versions exactly.

perf top shows exactly where the contention happens but I can not identify the root-cause.
I am not sure if it's directly related to io_uring or some other change that caused the regression.

5.7.15 and earlier does not inhibit such contention for the same machine (18 physical cores). I am testing on Ubuntu 20.04 patched with prebuilt kernels from https://kernel.ubuntu.com/~kernel-ppa/mainline/

For 5.7.15 my benchmark achieves 1.6M qps and system cpu is at ~80%.
for 5.7.16 or later it achieves only 1M qps and the system cpu is is at ~100% .

   57.66%  [kernel]       [k] native_queued_spin_lock_slowpath
     1.09%  [kernel]       [k] _raw_spin_lock_irqsave
     1.05%  [kernel]       [k] _raw_spin_lock
     0.83%  [kernel]       [k] tcp_recvmsg
     0.74%  [kernel]       [k] skb_release_data
     0.69%  [kernel]       [k] ena_start_xmit
     0.69%  [kernel]       [k] memset_erms
     0.69%  [kernel]       [k] __slab_free
     0.63%  [kernel]       [k] mwait_idle_with_hints.constprop.0
     0.56%  [kernel]       [k] copy_user_generic_unrolle
     0.49%  [kernel]       [k] ena_clean_tx_irq
     0.45%  [kernel]       [k] read_tsc

The text was updated successfully, but these errors were encountered:

axboe · 2020-09-27T22:51:09Z

It's probably this one:

commit 175852503b41b3b74ffd99789240fa21ea1b8584
Author: Jens Axboe axboe@kernel.dk
Date: Thu Aug 6 19:41:50 2020 -0600

io_uring: use TWA_SIGNAL for task_work uncondtionally

which I did suspect might cause some perf issues... What's your benchmark? Would love to be able to run it and compare. There are ways we can improve this.

romange · 2020-09-27T23:37:55Z

The benchmarks programs are attached here:
https://drive.google.com/file/d/1a8f8Za_f9Z6ejC3nlQf_0NRxV61m5cc5/view?usp=sharing

it's just a ping/pong setup simulating high throughput scenario.

You will need to install libunwind lib for the server.
There are some other shared libs that are bundled with ping_iouring_server.

On your server machine run:
LD_LIBRARY_PATH=... /ping_iouring_server --logtostderr
it should print AcceptServer - listening on port 6380

On the client machine run:
./redis-benchmark -p 6380 -h <server_ip> -n 10000000 -c 400 -t ping_inline --threads 24

you can play with -c (number of connections) and -threads on the client side to decrease/increase the load.

romange · 2020-09-28T11:08:22Z

I can confirm that 175852503b41b3b74ffd99789240fa21ea1b8584 causes the regression.

axboe · 2020-09-28T15:35:52Z

I ran your test case this morning, and I agree it does cause a slowdown. It's somewhat annoying, as 99.9% of use cases don't need this, and the reasons are pretty complicated. Basically you only need this for some eventfd related setups, where the same application that does the IO also ends up blocking in the kernel waiting on the eventfd. The eventfd will trigger when completions are run, but some completions are not run until the application exits that loop and processes completions.

One idea would be to pass in a flag for io_uring setup that says "I do weird stuff with eventfd, use the safe/slower path" and default to NOT doing this. Don't necessarily like that approach, but right now I can't think of a better way to handle this.

romange · 2020-09-28T15:55:55Z

Can you just document the exact flow as invalid/unsupported? It sounds like a deadlock what you are describing. Btw, I use eventfd polling to wake io_uring loop from other threads and it works very well so i guess it applies to very specific use-case of eventfd. Or maybe you mean - eventfd generated by io_uring completions?

…

On Mon, Sep 28, 2020, 18:36 Jens Axboe ***@***.***> wrote: I ran your test case this morning, and I agree it does cause a slowdown. It's somewhat annoying, as 99.9% of use cases don't need this, and the reasons are pretty complicated. Basically you only need this for some eventfd related setups, where the same application that does the IO also ends up blocking in the kernel waiting on the eventfd. The eventfd will trigger when completions are run, but some completions are not run until the application exits that loop and processes completions. One idea would be to pass in a flag for io_uring setup that says "I do weird stuff with eventfd, use the safe/slower path" and default to NOT doing this. Don't necessarily like that approach, but right now I can't think of a better way to handle this. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#215 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA4BFCCHMARHEMGCKUTGEBLSICUORANCNFSM4R32LOTQ> .

axboe · 2020-09-28T15:56:32Z

Note that this is a problem mainly for threaded cases, as threads share the signal group. To process the completion work, we need to lock the signal structure, which obviously turns to a mess if you have a lot of threads in the same group.

romange · 2020-09-28T15:59:00Z

What's this signal group? Can I allocate a signal group per thread?

…

On Mon, Sep 28, 2020, 18:56 Jens Axboe ***@***.***> wrote: Note that this is a problem mainly for threaded cases, as threads share the signal group. To process the completion work, we need to lock the signal structure, which obviously turns to a mess if you have a lot of threads in the same group. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#215 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA4BFCGOZFC7RHMEKTN7NNLSICW4BANCNFSM4R32LOTQ> .

isilence · 2020-09-28T16:09:24Z

One idea would be to pass in a flag for io_uring setup that says "I do weird stuff with eventfd, use the safe/slower path" and default to NOT doing this. Don't necessarily like that approach, but right now I can't think of a better way to handle this.

Can we detect it and set the flag in-kernel?

axboe · 2020-09-28T16:12:36Z

What's this signal group? Can I allocate a signal group per thread?

You can't currently break that sharing, clone(2) could potentially work, but doesn't support that.

axboe · 2020-09-28T16:16:50Z

Can you just document the exact flow as invalid/unsupported? It sounds like a deadlock what you are describing.

Here's an example:

#include <sys/eventfd.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <liburing.h>
#include <fcntl.h>
#include <poll.h>
#include <sys/time.h>

struct thread_data {
	struct io_uring *ring;
	int read_fd, write_fd;
};

static void error_exit(char *message)
{
	perror(message);
	exit(1);
}

static void *listener_thread(void *data)
{
	struct thread_data *td = data;
	struct io_uring_cqe *cqe;
	int ret;

        ret = io_uring_wait_cqe(td->ring, &cqe);
        if (ret < 0) {
        	fprintf(stderr, "Error waiting for completion: %s\n",
                	strerror(-ret));
		goto err;
        }
	if (cqe->res < 0) {
		fprintf(stderr, "Error in async operation: %s\n", strerror(-cqe->res));
		goto err;
        }
	io_uring_cqe_seen(td->ring, cqe);
	return NULL;
err:
	return (void *) 1;
}

static void *wakeup_io_uring(void *data)
{
	struct thread_data *td = data;
	char buf = 0x40;
	int res;

	res = write(td->write_fd, &buf, sizeof(buf));
	if (res < 0) {
		perror("write");
		return (void *) 1;
	}
	return NULL;
}

int main(int argc, char *argv[])
{
	struct io_uring_sqe *sqe;
	struct thread_data td;
	struct io_uring ring;
	pthread_t t1, t2;
	int ret, fds[2];
	void *pret;

	if (pipe(fds) < 0)
		error_exit("eventfd");

	ret = io_uring_queue_init(8, &ring, 0);
	if (ret) {
		fprintf(stderr, "Unable to setup io_uring: %s\n", strerror(-ret));
		return 1;
	}

	td.read_fd = fds[0];
	td.write_fd = fds[1];
	td.ring = &ring;

	sqe = io_uring_get_sqe(&ring);
	io_uring_prep_poll_add(sqe, fds[0], POLLIN);
	sqe->user_data = 2;
	ret = io_uring_submit(&ring);
	if (ret != 1) {
		fprintf(stderr, "ring_submit=%d\n", ret);
		return 1;
	}

	pthread_create(&t1, NULL, listener_thread, &td);

	sleep(1);

	pthread_create(&t2, NULL, wakeup_io_uring, &td);
	pthread_join(t1, &pret);

	io_uring_queue_exit(&ring);
	return pret != NULL;
}

Basically we have a main task, M, that issues a poll request. It then spawns two threads, one (T1) which waits for this poll request to complete, and one (T2) which issues a write that triggers this condition. This write spawns a task_work item for M, this work needs to be processed for the completion CQE to be filled in. Task M then waits for T1 to exit. M ends up waiting in futex, and hence doesn't process the task_work required to have T1 receive the completion event.

Hence we won't make any progress here unless we can interrupt the futex wait and process the completion.

axboe · 2020-09-28T16:18:54Z

Can we detect it and set the flag in-kernel?

I'd love for that to be the case, do let me know if you have any good ideas here. Continuing on the above example, we can't detect that task M is in the kernel and waiting. And even if we could, we could have a race if it's just on the way to do just that.

Would love to improve this case, it's very annoying to have to slow down the general case for some odd corner cases. It would be very important to fix.

axboe · 2020-09-28T16:34:19Z

For this exact case, we could make it work by just doing:

diff --git a/kernel/futex.c b/kernel/futex.c
index a5876694a60e..f2c244a3954f 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -2579,6 +2579,9 @@ static void futex_wait_queue_me(struct futex_hash_bucket *hb, struct futex_q *q,
 	 * has tried to wake us, and we can skip the call to schedule().
 	 */
 	if (likely(!plist_node_empty(&q->list))) {
+		if (current->task_works)
+			task_work_run();
+
 		/*
 		 * If the timer has already expired, current will already be
 		 * flagged for rescheduling. Only call schedule if there

but the problem is that this fixes just the one case that could trigger it. Maybe M would be waiting somewhere else. The canonical case is:

while (!condition_to_wait_on)
        schedule();

in essence, where we then never run the task work as we're not exiting the kernel and we need to process task_work for condition_to_wait_on to become true. TWA_SIGNAL forces this, but the side effects are a bit grim on (particularly) threaded cases.

romange · 2020-09-28T16:47:48Z

I understand the flow but I am probably missing something about the mechanics of io_uring behind the scene.

Why do we need M to process something? I thought io_uring completions are processed in magical kernel threads :) this is why we have ring-buffer - to synchronize completions why user thread does some other work...
Even if we need M to do something instead of being stuck on join, what will it be? why calling io_uring_wait_cqe in the listener (t1) thread is not enough?

As a general note, I find io_uring data structure not very thread-safe from userland perspective: you can not issue io_ring_get_sqe from multiple threads anyway because you may not know who might trigger io_uring_submit in parallel. Therefore, I consider io_uring single threaded creature. For example, for inter-thread communication case even though IORING_OP_NOP could be more efficient I did not find a way to issue it safely, so ended up using eventfd for the sole purpose of waking up other io_uring loops.

axboe · 2020-09-28T20:41:41Z

There are only magical threads if we need them, in general any sort of deferred try/retry is done by something called task_work (let's shorten that to tw). tw is run by the task itself, which makes it a lot less expensive than punting to a different thread. The work is run on transition between kernel/user.

So for your question 1+2, M has to process the poll request completion in the example above. Once T2 triggers the poll request, tw gets added to M. Once M runs this tw item, then the completion shows up in the ring and T1 gets happy and receives the completion event. That's why we need M to do something. For this poll example, we could potentially have T1 process the tw, but that's not necessarily the case for other work items, say if the work item needs to access the mm of M. This would work for T1 since it's a thread, but you could also have the above be M, P1, and P2 where P are forked processes.

And yes, the rings are not thread safe, generally a thread should have its own ring(s). You can, however, get by with isolating the completion and submission side, those are independent in the raw API, not in liburing though.

romange · 2020-09-29T05:10:44Z

thank you for the explanation, I think it could be useful to add it to io_uring.pdf or maybe to man pages.
In particular, I find this sentence in man pages a bit confusing.

Note that, for interrupt-driven I/O (where IORING_SETUP_IOPOLL was not specified in the call to io_uring_setup(2)), an application may check the completion queue for event completions  without entering the kernel at all.

I understand, based on your explanation, that in a regular case (without IORING_SETUP_I OPOLL or IORING_SETUP_SQPOLL), I need to delegate control to the kernel so it could process TWs and fill the CQ ring.

Do I have to call io_uring_enter or any kernel function will do the job?
Can I know efficiently from the userland that there are TWs that needs to be processed? If not, is it possible to add a flag to cq.kflags that tells TWs are available?

The reason I am asking is that I designed my io_uring loop under bit different assumptions - see below.
It still works but I thought as long as TWs become available we won't reach io_uring_wait_cqe step and io_uring_peek_batch_cqe is gonna harvest CQEs without any system calls. Now I understand that calling io_uring_wait_cqe is needed to process TWs since io_ring_submit does not call io_uring_enter if no SQEs are added. So, I am looking for the cheapest way to know I need to yield to the kernel, i.e. that there are TWs are pending and I should call io_uring_enter(ring->ring_fd, 0, 0, IORING_ENTER_GETEVENTS, NULL).

while(true) {
     num_requests = io_uring_submit(&ring);
     num_msgs = handle_message_passing();

     io_uring_peek_batch_cqe(&ring, &completions);
     if (!completions.empty()) {
       dispatch_completions();
     }
     if (num_requests || num_msgs || !completions.empty()) {
       continue;
     }

      io_uring_wait_cqe(&ring);   // epoll_wait(timeout=-1)
  };

axboe · 2020-09-30T20:51:10Z

Something like this might be viable. If you have a chance to be able to test this, then that'd be great...

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..79057cecf121 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -18,6 +18,7 @@ config X86_32
 	select MODULES_USE_ELF_REL
 	select OLD_SIGACTION
 	select GENERIC_VDSO_32
+	select ARCH_HAS_TIF_TW
 
 config X86_64
 	def_bool y
@@ -30,6 +31,7 @@ config X86_64
 	select MODULES_USE_ELF_RELA
 	select NEED_DMA_MAP_STATE
 	select SWIOTLB
+	select ARCH_HAS_TIF_TW
 
 config FORCE_DYNAMIC_FTRACE
 	def_bool y
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701ae3d86..79fe7db3208c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_SLD			18	/* Restore split lock detection on context switch */
+#define TIF_TASKWORK		19	/* task_work pending */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_SLD		(1 << TIF_SLD)
+#define _TIF_TASKWORK		(1 << TIF_TASKWORK)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 159c7476b11b..a8e3dbdfad63 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -69,7 +69,7 @@
 
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_TASKWORK|	\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..470a1da5166b 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -355,7 +355,8 @@ static inline int restart_syscall(void)
 
 static inline int signal_pending(struct task_struct *p)
 {
-	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
+	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING) ||
+			test_tsk_thread_flag(p, TIF_TASKWORK));
 }
 
 static inline int __fatal_signal_pending(struct task_struct *p)
@@ -502,7 +503,8 @@ extern int set_user_sigmask(const sigset_t __user *umask, size_t sigsetsize);
 static inline void restore_saved_sigmask_unless(bool interrupted)
 {
 	if (interrupted)
-		WARN_ON(!test_thread_flag(TIF_SIGPENDING));
+		WARN_ON(!test_thread_flag(TIF_SIGPENDING) &&
+			!test_thread_flag(TIF_TASKWORK));
 	else
 		restore_saved_sigmask();
 }
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 7bbc0e9cf084..4bf470d66791 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -6,6 +6,11 @@
 #include <linux/signal_types.h>
 #include <linux/string.h>
 
+#ifndef CONFIG_ARCH_HAS_TIF_TW
+#define TIF_TASK_WORK	0
+#define _TIF_TASK_WORK	0
+#endif
+
 struct task_struct;
 
 /* for sysctl */
diff --git a/init/Kconfig b/init/Kconfig
index d6a0b31b13dc..3f9c9dd69f84 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2333,6 +2333,9 @@ config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
 config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
 	bool
 
+config ARCH_HAS_TIF_TW
+	bool
+
 # It may be useful for an architecture to override the definitions of the
 # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
 # and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 6fdb6105e6d6..755e0aac7e2c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -160,6 +160,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		if (ti_work & _TIF_SIGPENDING)
 			arch_do_signal(regs);
 
+		if (ti_work & _TIF_TASKWORK)
+			task_work_run();
+
 		if (ti_work & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
diff --git a/kernel/signal.c b/kernel/signal.c
index a38b3edc6851..3ed090798811 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2529,6 +2529,9 @@ bool get_signal(struct ksignal *ksig)
 	struct signal_struct *signal = current->signal;
 	int signr;
 
+	if (unlikely(current->task_works))
+		task_work_run();
+
 	if (unlikely(uprobe_deny_signal()))
 		return false;
 
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 613b2d634af8..de85a4c3f899 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -28,7 +28,6 @@ int
 task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 {
 	struct callback_head *head;
-	unsigned long flags;
 
 	do {
 		head = READ_ONCE(task->task_works);
@@ -41,7 +40,10 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 	case TWA_RESUME:
 		set_notify_resume(task);
 		break;
-	case TWA_SIGNAL:
+	case TWA_SIGNAL: {
+#ifndef CONFIG_ARCH_HAS_TIF_TW
+		unsigned long flags;
+
 		/*
 		 * Only grab the sighand lock if we don't already have some
 		 * task_work pending. This pairs with the smp_store_mb()
@@ -53,7 +55,11 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 			signal_wake_up(task, 0);
 			unlock_task_sighand(task, &flags);
 		}
+#else
+		set_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
 		break;
+		}
 	}
 
 	return 0;
@@ -110,6 +116,11 @@ void task_work_run(void)
 	struct task_struct *task = current;
 	struct callback_head *work, *head, *next;
 
+#ifdef CONFIG_ARCH_HAS_TIF_TW
+	if (test_tsk_thread_flag(task, TIF_TASKWORK))
+		clear_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
+
 	for (;;) {
 		/*
 		 * work->func() can do task_work_add(), do not set

romange · 2020-09-30T21:05:59Z

Sure, will test it during the weekend and let you know. Btw, why is it only for x86? Will you apply similar changes to aarch64?

…

On Wed, Sep 30, 2020, 23:51 Jens Axboe ***@***.***> wrote: Something like this might be viable. If you have a chance to be able to test this, then that'd be great... diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 7101ac64bb20..79057cecf121 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -18,6 +18,7 @@ config X86_32 select MODULES_USE_ELF_REL select OLD_SIGACTION select GENERIC_VDSO_32 + select ARCH_HAS_TIF_TW config X86_64 def_bool y @@ -30,6 +31,7 @@ config X86_64 select MODULES_USE_ELF_RELA select NEED_DMA_MAP_STATE select SWIOTLB + select ARCH_HAS_TIF_TW config FORCE_DYNAMIC_FTRACE def_bool y diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h index 267701ae3d86..79fe7db3208c 100644 --- a/arch/x86/include/asm/thread_info.h +++ b/arch/x86/include/asm/thread_info.h @@ -93,6 +93,7 @@ struct thread_info { #define TIF_NOTSC 16 /* TSC is not accessible in userland */ #define TIF_IA32 17 /* IA32 compatibility process */ #define TIF_SLD 18 /* Restore split lock detection on context switch */ +#define TIF_TASKWORK 19 /* task_work pending */ #define TIF_MEMDIE 20 /* is terminating due to OOM killer */ #define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */ #define TIF_IO_BITMAP 22 /* uses I/O bitmap */ @@ -123,6 +124,7 @@ struct thread_info { #define _TIF_NOTSC (1 << TIF_NOTSC) #define _TIF_IA32 (1 << TIF_IA32) #define _TIF_SLD (1 << TIF_SLD) +#define _TIF_TASKWORK (1 << TIF_TASKWORK) #define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG) #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP) #define _TIF_FORCED_TF (1 << TIF_FORCED_TF) diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h index 159c7476b11b..a8e3dbdfad63 100644 --- a/include/linux/entry-common.h +++ b/include/linux/entry-common.h @@ -69,7 +69,7 @@ #define EXIT_TO_USER_MODE_WORK \ (_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \ - _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | \ + _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_TASKWORK| \ ARCH_EXIT_TO_USER_MODE_WORK) /** diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h index 1bad18a1d8ba..470a1da5166b 100644 --- a/include/linux/sched/signal.h +++ b/include/linux/sched/signal.h @@ -355,7 +355,8 @@ static inline int restart_syscall(void) static inline int signal_pending(struct task_struct *p) { - return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING)); + return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING) || + test_tsk_thread_flag(p, TIF_TASKWORK)); } static inline int __fatal_signal_pending(struct task_struct *p) @@ -502,7 +503,8 @@ extern int set_user_sigmask(const sigset_t __user *umask, size_t sigsetsize); static inline void restore_saved_sigmask_unless(bool interrupted) { if (interrupted) - WARN_ON(!test_thread_flag(TIF_SIGPENDING)); + WARN_ON(!test_thread_flag(TIF_SIGPENDING) && + !test_thread_flag(TIF_TASKWORK)); else restore_saved_sigmask(); } diff --git a/include/linux/signal.h b/include/linux/signal.h index 7bbc0e9cf084..4bf470d66791 100644 --- a/include/linux/signal.h +++ b/include/linux/signal.h @@ -6,6 +6,11 @@ #include <linux/signal_types.h> #include <linux/string.h> +#ifndef CONFIG_ARCH_HAS_TIF_TW +#define TIF_TASK_WORK 0 +#define _TIF_TASK_WORK 0 +#endif + struct task_struct; /* for sysctl */ diff --git a/init/Kconfig b/init/Kconfig index d6a0b31b13dc..3f9c9dd69f84 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -2333,6 +2333,9 @@ config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE bool +config ARCH_HAS_TIF_TW + bool + # It may be useful for an architecture to override the definitions of the # SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h> # and the COMPAT_ variants in <linux/compat.h>, in particular to use a diff --git a/kernel/entry/common.c b/kernel/entry/common.c index 6fdb6105e6d6..755e0aac7e2c 100644 --- a/kernel/entry/common.c +++ b/kernel/entry/common.c @@ -160,6 +160,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs, if (ti_work & _TIF_SIGPENDING) arch_do_signal(regs); + if (ti_work & _TIF_TASKWORK) + task_work_run(); + if (ti_work & _TIF_NOTIFY_RESUME) { clear_thread_flag(TIF_NOTIFY_RESUME); tracehook_notify_resume(regs); diff --git a/kernel/signal.c b/kernel/signal.c index a38b3edc6851..3ed090798811 100644 --- a/kernel/signal.c +++ b/kernel/signal.c @@ -2529,6 +2529,9 @@ bool get_signal(struct ksignal *ksig) struct signal_struct *signal = current->signal; int signr; + if (unlikely(current->task_works)) + task_work_run(); + if (unlikely(uprobe_deny_signal())) return false; diff --git a/kernel/task_work.c b/kernel/task_work.c index 613b2d634af8..de85a4c3f899 100644 --- a/kernel/task_work.c +++ b/kernel/task_work.c @@ -28,7 +28,6 @@ int task_work_add(struct task_struct *task, struct callback_head *work, int notify) { struct callback_head *head; - unsigned long flags; do { head = READ_ONCE(task->task_works); @@ -41,7 +40,10 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify) case TWA_RESUME: set_notify_resume(task); break; - case TWA_SIGNAL: + case TWA_SIGNAL: { +#ifndef CONFIG_ARCH_HAS_TIF_TW + unsigned long flags; + /* * Only grab the sighand lock if we don't already have some * task_work pending. This pairs with the smp_store_mb() @@ -53,7 +55,11 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify) signal_wake_up(task, 0); unlock_task_sighand(task, &flags); } +#else + set_tsk_thread_flag(task, TIF_TASKWORK); +#endif break; + } } return 0; @@ -110,6 +116,11 @@ void task_work_run(void) struct task_struct *task = current; struct callback_head *work, *head, *next; +#ifdef CONFIG_ARCH_HAS_TIF_TW + if (test_tsk_thread_flag(task, TIF_TASKWORK)) + clear_tsk_thread_flag(task, TIF_TASKWORK); +#endif + for (;;) { /* * work->func() can do task_work_add(), do not set — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#215 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA4BFCCCQUZVUB2DE4MB3QDSIOK43ANCNFSM4R32LOTQ> .

axboe · 2020-09-30T21:11:46Z

Thanks! I'm still fiddling around with this, it's just x86 only right now since a) it's arch dependent, and b) that's the only arch I can test on. There's nothing preventing similar changes on other archs.

axboe · 2020-09-30T21:41:47Z

Here's a cleaned up version:

commit df2bf7418769dfe879087bf3fbe85b2f0736d2f8
Author: Jens Axboe <axboe@kernel.dk>
Date:   Wed Sep 30 15:36:02 2020 -0600

    kernel: decouple TASK_WORK TWA_SIGNAL handling from signals
    
    Users of TWA_SIGNAL need to have system calls interrupted and go through
    a kernel/user transition to ensure they are run. This currently works
    well from a functional standpoint, but it is heavy handed on a
    multithreaded application where sighand is shared between the threads and
    main process. Adding TWA_SIGNAL task_work on such setups need to grab
    the sighand->lock, which creates a hot spot for otherwise unrelated
    task_work.
    
    This adds TIF_TASKWORK for x86, which if set, will return true on
    checking for pending signals. That in turn causes tasks to restart the
    system call, which will run the added task_work. If TIF_TASKWORK is
    available, we'll use that for notification when TWA_SIGNAL is specified.
    If it isn't available, the existing TIF_SIGPENDING path is used.
    
    Once all archs have added support for TIF_TASKWORK, we can kill the
    old code completely. That will also allow removal of JOBCTL_TASK_WORK
    and related code.
    
    Signed-off-by: Jens Axboe <axboe@kernel.dk>

diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701ae3d86..79fe7db3208c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
 #define TIF_NOTSC		16	/* TSC is not accessible in userland */
 #define TIF_IA32		17	/* IA32 compatibility process */
 #define TIF_SLD			18	/* Restore split lock detection on context switch */
+#define TIF_TASKWORK		19	/* task_work pending */
 #define TIF_MEMDIE		20	/* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG	21	/* idle is polling for TIF_NEED_RESCHED */
 #define TIF_IO_BITMAP		22	/* uses I/O bitmap */
@@ -123,6 +124,7 @@ struct thread_info {
 #define _TIF_NOTSC		(1 << TIF_NOTSC)
 #define _TIF_IA32		(1 << TIF_IA32)
 #define _TIF_SLD		(1 << TIF_SLD)
+#define _TIF_TASKWORK		(1 << TIF_TASKWORK)
 #define _TIF_POLLING_NRFLAG	(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP		(1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF		(1 << TIF_FORCED_TF)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 159c7476b11b..0a3702b4774c 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -37,6 +37,10 @@
 # define _TIF_UPROBE			(0)
 #endif
 
+#ifndef _TIF_TASKWORK
+# define _TIF_TASKWORK			(0)
+#endif
+
 /*
  * TIF flags handled in syscall_enter_from_usermode()
  */
@@ -69,7 +73,7 @@
 
 #define EXIT_TO_USER_MODE_WORK						\
 	(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE |		\
-	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING |			\
+	 _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_TASKWORK|	\
 	 ARCH_EXIT_TO_USER_MODE_WORK)
 
 /**
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..d1a02749df74 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -355,7 +355,17 @@ static inline int restart_syscall(void)
 
 static inline int signal_pending(struct task_struct *p)
 {
-	return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
+#ifdef TIF_TASKWORK
+	/*
+	 * TIF_TASKWORK isn't really a signal, but it requires the same
+	 * behavior of restarting the system call to force a kernel/user
+	 * transition.
+	 */
+	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING) ||
+			test_tsk_thread_flag(p, TIF_TASKWORK));
+#else
+	return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING));
+#endif
 }
 
 static inline int __fatal_signal_pending(struct task_struct *p)
@@ -501,10 +511,16 @@ extern int set_user_sigmask(const sigset_t __user *umask, size_t sigsetsize);
 
 static inline void restore_saved_sigmask_unless(bool interrupted)
 {
-	if (interrupted)
+	if (interrupted) {
+#ifdef TIF_TASKWORK
+		WARN_ON(!test_thread_flag(TIF_SIGPENDING) &&
+			!test_thread_flag(TIF_TASKWORK));
+#else
 		WARN_ON(!test_thread_flag(TIF_SIGPENDING));
-	else
+#endif
+	} else {
 		restore_saved_sigmask();
+	}
 }
 
 static inline sigset_t *sigmask_to_save(void)
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 6fdb6105e6d6..755e0aac7e2c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -160,6 +160,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
 		if (ti_work & _TIF_SIGPENDING)
 			arch_do_signal(regs);
 
+		if (ti_work & _TIF_TASKWORK)
+			task_work_run();
+
 		if (ti_work & _TIF_NOTIFY_RESUME) {
 			clear_thread_flag(TIF_NOTIFY_RESUME);
 			tracehook_notify_resume(regs);
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 613b2d634af8..3cd7c6ab9985 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -28,7 +28,6 @@ int
 task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 {
 	struct callback_head *head;
-	unsigned long flags;
 
 	do {
 		head = READ_ONCE(task->task_works);
@@ -41,7 +40,10 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 	case TWA_RESUME:
 		set_notify_resume(task);
 		break;
-	case TWA_SIGNAL:
+	case TWA_SIGNAL: {
+#ifndef TIF_TASKWORK
+		unsigned long flags;
+
 		/*
 		 * Only grab the sighand lock if we don't already have some
 		 * task_work pending. This pairs with the smp_store_mb()
@@ -53,7 +55,11 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
 			signal_wake_up(task, 0);
 			unlock_task_sighand(task, &flags);
 		}
+#else
+		set_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
 		break;
+		}
 	}
 
 	return 0;
@@ -110,6 +116,11 @@ void task_work_run(void)
 	struct task_struct *task = current;
 	struct callback_head *work, *head, *next;
 
+#ifdef TIF_TASKWORK
+	if (test_tsk_thread_flag(task, TIF_TASKWORK))
+		clear_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
+
 	for (;;) {
 		/*
 		 * work->func() can do task_work_add(), do not set

axboe · 2020-09-30T21:46:06Z

Committed it here for testing:

https://git.kernel.dk/cgit/linux-block/log/?h=tif-task_work

and I also added an arm64 patch for you. Also shows how trivial it is to add arch support for this.

romange · 2020-09-30T22:32:03Z

checking. meanwhile pls fix "defien" typo.

…

On Thu, Oct 1, 2020 at 12:46 AM Jens Axboe ***@***.***> wrote: Committed it here for testing: https://git.kernel.dk/cgit/linux-block/log/?h=tif-task_work and I also added an arm64 patch for you. Also shows how trivial it is to add arch support for this. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#215 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AA4BFCBQM2RGXMQCO5QLEOTSIORKZANCNFSM4R32LOTQ> .

-- Best regards, Roman

romange · 2020-10-01T07:34:56Z

Ok, checked the latest fix c6a44403af72ba7788b82ee83bd53ff972923117 and I confirm that the performance issue is solved!
Ping server reaches 2.7M qps run on 72 vcpus and saturate networking capacity way before overloading the cpu.

Great job!

P.S. Please backport the fix to 5.8 for the upcoming Ubuntu release.

axboe · 2020-10-01T14:31:19Z

Awesome, thanks for testing! I've cleaned it up a bit more, and posted it for comments/review:

https://lore.kernel.org/io-uring/0b5336a7-c975-a8f8-e988-e983e2340d99@kernel.dk/T/#u

I forgot to add your Reported-by, I'll get that done for the next round. I'm sure we're not done yet...

romange · 2020-10-01T14:49:17Z

np, I've recently learnt that my company will appreciate if my work email will appear in the log. Can you please use romger@amazon.com from now on?

axboe · 2020-10-01T14:50:45Z

You bet, I'll use that email. Once we have a final and you've vetted it, would be great if I could also add your Tested-by in there.

axboe · 2020-11-24T21:03:09Z

It's queued up for 5.11 and will go into 5.10-stable once that's done. Closing this one out.

…x-block Pull TIF_NOTIFY_SIGNAL updates from Jens Axboe: "This sits on top of of the core entry/exit and x86 entry branch from the tip tree, which contains the generic and x86 parts of this work. Here we convert the rest of the archs to support TIF_NOTIFY_SIGNAL. With that done, we can get rid of JOBCTL_TASK_WORK from task_work and signal.c, and also remove a deadlock work-around in io_uring around knowing that signal based task_work waking is invoked with the sighand wait queue head lock. The motivation for this work is to decouple signal notify based task_work, of which io_uring is a heavy user of, from sighand. The sighand lock becomes a huge contention point, particularly for threaded workloads where it's shared between threads. Even outside of threaded applications it's slower than it needs to be. Roman Gershman <romger@amazon.com> reported that his networked workload dropped from 1.6M QPS at 80% CPU to 1.0M QPS at 100% CPU after io_uring was changed to use TIF_NOTIFY_SIGNAL. The time was all spent hammering on the sighand lock, showing 57% of the CPU time there [1]. There are further cleanups possible on top of this. One example is TIF_PATCH_PENDING, where a patch already exists to use TIF_NOTIFY_SIGNAL instead. Hopefully this will also lead to more consolidation, but the work stands on its own as well" [1] axboe/liburing#215 * tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block: (28 commits) io_uring: remove 'twa_signal_ok' deadlock work-around kernel: remove checking for TIF_NOTIFY_SIGNAL signal: kill JOBCTL_TASK_WORK io_uring: JOBCTL_TASK_WORK is no longer used by task_work task_work: remove legacy TWA_SIGNAL path sparc: add support for TIF_NOTIFY_SIGNAL riscv: add support for TIF_NOTIFY_SIGNAL nds32: add support for TIF_NOTIFY_SIGNAL ia64: add support for TIF_NOTIFY_SIGNAL h8300: add support for TIF_NOTIFY_SIGNAL c6x: add support for TIF_NOTIFY_SIGNAL alpha: add support for TIF_NOTIFY_SIGNAL xtensa: add support for TIF_NOTIFY_SIGNAL arm: add support for TIF_NOTIFY_SIGNAL microblaze: add support for TIF_NOTIFY_SIGNAL hexagon: add support for TIF_NOTIFY_SIGNAL csky: add support for TIF_NOTIFY_SIGNAL openrisc: add support for TIF_NOTIFY_SIGNAL sh: add support for TIF_NOTIFY_SIGNAL um: add support for TIF_NOTIFY_SIGNAL ...

axboe mentioned this issue Sep 30, 2020

io_uring is slower than epoll #189

Closed

isilence mentioned this issue Oct 2, 2020

native_queued_spin_lock_slowpath contention even after #215 fix #218

Closed

axboe closed this as completed Nov 24, 2020

jorangreef mentioned this issue May 26, 2021

Proposal: Event loop redesign ziglang/zig#8224

Open

isilence mentioned this issue Oct 18, 2021

Need some help to use liburing properly #458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance degradation starting from 5.7.16 #215

performance degradation starting from 5.7.16 #215

romange commented Sep 27, 2020

axboe commented Sep 27, 2020

romange commented Sep 27, 2020

romange commented Sep 28, 2020

axboe commented Sep 28, 2020

romange commented Sep 28, 2020 via email

axboe commented Sep 28, 2020

romange commented Sep 28, 2020 via email

isilence commented Sep 28, 2020

axboe commented Sep 28, 2020

axboe commented Sep 28, 2020

axboe commented Sep 28, 2020

axboe commented Sep 28, 2020

romange commented Sep 28, 2020 •

edited

Loading

axboe commented Sep 28, 2020

romange commented Sep 29, 2020 •

edited

Loading

axboe commented Sep 30, 2020

romange commented Sep 30, 2020 via email

axboe commented Sep 30, 2020

axboe commented Sep 30, 2020

axboe commented Sep 30, 2020

romange commented Sep 30, 2020 via email

romange commented Oct 1, 2020 •

edited

Loading

axboe commented Oct 1, 2020

romange commented Oct 1, 2020

axboe commented Oct 1, 2020

axboe commented Nov 24, 2020

performance degradation starting from 5.7.16 #215

performance degradation starting from 5.7.16 #215

Comments

romange commented Sep 27, 2020

axboe commented Sep 27, 2020

romange commented Sep 27, 2020

romange commented Sep 28, 2020

axboe commented Sep 28, 2020

romange commented Sep 28, 2020 via email

axboe commented Sep 28, 2020

romange commented Sep 28, 2020 via email

isilence commented Sep 28, 2020

axboe commented Sep 28, 2020

axboe commented Sep 28, 2020

axboe commented Sep 28, 2020

axboe commented Sep 28, 2020

romange commented Sep 28, 2020 • edited Loading

axboe commented Sep 28, 2020

romange commented Sep 29, 2020 • edited Loading

axboe commented Sep 30, 2020

romange commented Sep 30, 2020 via email

axboe commented Sep 30, 2020

axboe commented Sep 30, 2020

axboe commented Sep 30, 2020

romange commented Sep 30, 2020 via email

romange commented Oct 1, 2020 • edited Loading

axboe commented Oct 1, 2020

romange commented Oct 1, 2020

axboe commented Oct 1, 2020

axboe commented Nov 24, 2020

romange commented Sep 28, 2020 •

edited

Loading

romange commented Sep 29, 2020 •

edited

Loading

romange commented Oct 1, 2020 •

edited

Loading