Too many threads with IOSQE_ASYNC and bad performance without... #420

johalun · 2021-08-27T16:46:38Z

First, I don't want to use SQPOLL since that would have a CPU spinning at 100% all the time (our application is never idle, even at low util).

So I was having poor performance and profiling the app I see io_uring_submit spends a lot of time in the kernel. Like 30% of the main threads time is spent there. Then I found IOSQE_ASYNC which sounded great, until I realized it spawns a thread for each request, meaning 10000+ threads getting created on the system. I was expecting IOSQE_ASYNC to offload the request to a fixed number of worker threads. Is there not such an option? If not, what are the other alternatives to reduce main threads time spent in the kernel?

Thanks!

axboe · 2021-08-27T16:51:24Z

What kernel is this? Since you did profiling, care to share those profiles? I'm assuming it's the actual request type handler that ends up being expensive, for the submission side. But would need to go a bit deeper than that to figure out what the real problem is here.

For IOSQE_ASYNC, it really depends on the request type. For bounded execution time, the number of threads is limited. For unbounded (like network, poll, etc), there's really not. We could add something like that, but it might work better if you just define how many inflight you want, and then mark appropriately in the app. That said, should be very feasible to add a way to have io_uring_register() have a type that can get/set the max unbounded number of threads. If you want to test, I can surely do that.

johalun · 2021-08-27T17:02:01Z

I've tested with 5.12.13 and 5.13.9. The request types are TCP socket and disk file I/O, including splice using intermediate pipes (my implementation of sendfile). Splice and disk I/O always get offloaded to a (single I think) worker thread. Network I/O's io_write and io_read is what takes up CPU time from the main thread when I'm not using IOSQE_ASYNC. On the flamegraph I see io_uring_submit calls into io_read->tcp_recvmsg and io_write->tcp_sendmsg. I can send you a flamegraph svg privately if you want.

johalun · 2021-08-27T17:05:36Z

Having an option to limit the number of threads would be great! Would that be so that each worker thread has their own work queue, or will new requests run synchronously when all worker threads are occupied?

axboe · 2021-08-27T17:06:32Z

Something like this:

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 8da9bb103916..22b39923e000 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -1152,6 +1152,27 @@ int io_wq_cpu_affinity(struct io_wq *wq, cpumask_var_t mask)
 	return 0;
 }
 
+/*
+ * Set max number of unbounded workers, returns old value.
+ */
+int io_wq_max_unbound(struct io_wq *wq, int new_count)
+{
+	int i, prev = 0;
+
+	if (new_count > task_rlimit(current, RLIMIT_NPROC))
+		return -EINVAL;
+
+	rcu_read_lock();
+	for_each_node(i) {
+		struct io_wqe *wqe = wq->wqes[i];
+
+		prev = max_t(int, wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers, prev);
+		wqe->acct[IO_WQ_ACCT_UNBOUND].max_workers = new_count;
+	}
+	rcu_read_unlock();
+	return prev;
+}
+
 static __init int io_wq_init(void)
 {
 	int ret;
diff --git a/fs/io-wq.h b/fs/io-wq.h
index 308af3928424..d6631c5fca11 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -128,6 +128,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work);
 void io_wq_hash_work(struct io_wq_work *work, void *val);
 
 int io_wq_cpu_affinity(struct io_wq *wq, cpumask_var_t mask);
+int io_wq_max_unbound(struct io_wq *wq, int new_count);
 
 static inline bool io_wq_is_hashed(struct io_wq_work *work)
 {
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 53326449d685..8c4d9978b877 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -10233,6 +10233,22 @@ static int io_unregister_iowq_aff(struct io_ring_ctx *ctx)
 	return io_wq_cpu_affinity(tctx->io_wq, NULL);
 }
 
+static int io_register_iowq_max_unbound(struct io_ring_ctx *ctx,
+					void __user *arg)
+{
+	struct io_uring_task *tctx = current->io_uring;
+	__u32 new_count;
+
+	if (!tctx || !tctx->io_wq)
+		return -EINVAL;
+	if (copy_from_user(&new_count, arg, sizeof(new_count)))
+		return -EFAULT;
+	if (!new_count || new_count > INT_MAX)
+		return -EINVAL;
+
+	return io_wq_max_unbound(tctx->io_wq, new_count);
+}
+
 static bool io_register_op_must_quiesce(int op)
 {
 	switch (op) {
@@ -10250,6 +10266,7 @@ static bool io_register_op_must_quiesce(int op)
 	case IORING_REGISTER_BUFFERS_UPDATE:
 	case IORING_REGISTER_IOWQ_AFF:
 	case IORING_UNREGISTER_IOWQ_AFF:
+	case IORING_REGISTER_IOWQ_MAX_UNBOUND:
 		return false;
 	default:
 		return true;
@@ -10406,6 +10423,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_unregister_iowq_aff(ctx);
 		break;
+	case IORING_REGISTER_IOWQ_MAX_UNBOUND:
+		ret = -EINVAL;
+		if (!arg || nr_args != 1)
+			break;
+		ret = io_register_iowq_max_unbound(ctx, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 45a4f2373694..bb6845e14629 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -309,6 +309,9 @@ enum {
 	IORING_REGISTER_IOWQ_AFF		= 17,
 	IORING_UNREGISTER_IOWQ_AFF		= 18,
 
+	/* set/get max number of unbounded workers */
+	IORING_REGISTER_IOWQ_MAX_UNBOUND	= 19,
+
 	/* this goes last */
 	IORING_REGISTER_LAST
 };

axboe · 2021-08-27T17:08:13Z

Please do send the flamegraph, but it also sounds like a case of "the networking side is expensive"...

The way it works is that there's a queue per node for each ring, and if we queue work and there's no available worker, then we'll see if we need to create one. If we do, then a new one is created and it'll handle that new work. If we cannot create a new worker (like would happen if you set the count lower with the above patch), then the newly added work would not get executed until a previous worker of the same type has finished what it is doing.

axboe · 2021-08-27T17:12:10Z

I'll test the posted patch and send it out for review. Would be great if you could tell me what your email is, I like to put Reported-by etc tags in the actual commit to help show who requested/reviewed/etc what.

johalun · 2021-08-27T17:20:43Z

johalun0@gmail.com

axboe · 2021-08-27T17:38:53Z

Tested the patch and posted it.

johalun · 2021-08-27T17:41:47Z

Thanks! I'll test it too today or over the weekend. By the way, is the limit set by IORING_REGISTER_IOWQ_MAX_UNBOUND global or per process?

axboe · 2021-08-27T17:42:41Z

The limit is per-ring

axboe · 2021-08-27T17:49:40Z

Actually, to be more specific, it's per node per ring. So if you have a dual core system with 2 nodes, then the limit is really doubled if you keep both nodes loaded on that particular ring. Hope that makes sense...

johalun · 2021-08-27T17:52:12Z

Makes sense, thanks!

io-wq divides work into two categories: 1) Work that completes in a bounded time, like reading from a regular file or a block device. This type of work is limited based on the size of the SQ ring. 2) Work that may never complete, we call this unbounded work. The amount of workers here is just limited by RLIMIT_NPROC. For various uses cases, it's handy to have the kernel limit the maximum amount of pending unbounded workers. Provide a way to do with with a new IORING_REGISTER_IOWQ_MAX_UNBOUND operation. IORING_REGISTER_IOWQ_MAX_UNBOUND takes an integer and sets the max worker count to what is being passed in, and returns the old value (or an error). If 0 is being passed in, it simply returns the current value. The value is capped at RLIMIT_NPROC. This actually isn't that important as it's more of a hint, if we're exceeding the value then our attempt to fork a new worker will fail. This happens naturally already if more than one node is in the system, as these values are per-node internally for io-wq. Reported-by: Johannes Lundberg <johalun0@gmail.com> Link: axboe/liburing#420 Signed-off-by: Jens Axboe <axboe@kernel.dk>

johalun · 2021-08-29T04:56:10Z

I applied the patch to 5.14-rc7 and built the kernel. The io_uring_register command gives me correct before and after values, however, my application still create new threads for each submission resulting in 100's or 1000's of threads created.

axboe · 2021-08-29T12:50:18Z

Just so we're clear, this is only for the unbounded workers. The bounded workers are still capped by the SQ ring size, in detail:

        /* Do QD, or 4 * CPUS, whatever is smallest */
        concurrency = min(ctx->sq_entries, 4 * num_online_cpus());

I need to double check, but we had a few worker creation fixes recently, but they should be in -rc7. Just for sake of completeness, can you try and pull:

git://git.kernel.dk/linux-block for-5.15/io_uring

into a v5.14-rc7 branch on your end and test with that? If it still fails to properly cap your situation, I'll take a closer look. The basic test case I wrote worked just fine, but there could be differences in what is being queued.

axboe · 2021-08-29T13:25:24Z

My silly test case:

/* SPDX-License-Identifier: MIT */
/*
 * Description: test max unbounded workers
 *
 */
#include <errno.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <sys/types.h>
#include <dirent.h>

#include "liburing.h"
#include "../src/syscall.h"

static int count_threads(void)
{
	struct dirent *entry;
	DIR *pdir;
	int ret;

	pdir = opendir("/proc/self/task");
	if (!pdir)
		 return -1;

	ret = 0;
	while ((entry = readdir(pdir)) != NULL) {
		if (entry->d_name[0] == '.')
			continue;
		ret++;
	}
	closedir(pdir);
	return ret - 1;
}

static int submit_read(struct io_uring *ring, int fd)
{
	struct io_uring_sqe *sqe;
	void *buf;
	int ret;

	buf = malloc(8);

	sqe = io_uring_get_sqe(ring);
	if (!sqe) {
		fprintf(stderr, "get sqe failed\n");
		goto err;
	}

	io_uring_prep_read(sqe, fd, buf, 8, 0);

	ret = io_uring_submit(ring);
	if (ret <= 0) {
		fprintf(stderr, "sqe submit failed: %d\n", ret);
		goto err;
	}

	return 0;
err:
	return 1;
}

int main(int argc, char *argv[])
{
	struct io_uring ring;
	int ret, fds[2], i;
	int max_workers, input;

	if (argc != 2) {
		printf("%s: <max workers>\n", argv[0]);
		return 1;
	}
	input = atoi(argv[1]);

	if (pipe(fds) < 0) {
		perror("pipe");
		return 1;
	}

	ret = io_uring_queue_init(64, &ring, 0);
	if (ret) {
		fprintf(stderr, "ring setup failed: %d\n", ret);
		return 1;
	}

	max_workers = __sys_io_uring_register(ring.ring_fd, IORING_REGISTER_IOWQ_MAX_UNBOUND, &input, 1);
	if (max_workers < 0) {
		perror("register");
		return 1;
	}
	printf("max_workers was %d, now %d\n", max_workers, input);

	for (i = 0; i < 1024; i++)
		submit_read(&ring, fds[0]);

	sleep(1);

	printf("Threads: %d\n", count_threads());
	return 0;
}

which basically just queues 1024 reads that will go async. Since they are on a pipe, execution time is unbounded. When the reads have been queued, we check how many worker threads are running. Sample run:

[root@archlinux liburing]# test/max_workers 2
max_workers was 7923, now 2
Threads: 2
[root@archlinux liburing]# test/max_workers 128
max_workers was 7923, now 128
Threads: 128
[root@archlinux liburing]# test/max_workers 16
max_workers was 7923, now 16
Threads: 16

axboe · 2021-08-29T13:51:20Z

If this is a bounded issue as well, we can modify the patch to allow setting both of them. Here's an updated version you can apply to 5.14-rc7 and test. Note that this one requires you to pass in an int max_workers[2] array instead, where index 0 is the bounded count, and 1 is the unbounded count. The return value is now always 0/-1, and the previous count is copied into the array on success instead.

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 7d2ed8c7dd31..93a360938561 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -1143,6 +1143,35 @@ int io_wq_cpu_affinity(struct io_wq *wq, cpumask_var_t mask)
 	return 0;
 }
 
+/*
+ * Set max number of unbounded workers, returns old value. If new_count is 0,
+ * then just return the old value.
+ */
+int io_wq_max_workers(struct io_wq *wq, int *new_count)
+{
+	int i, node, prev = 0;
+
+	for (i = 0; i < 2; i++) {
+		if (new_count[i] > task_rlimit(current, RLIMIT_NPROC))
+			new_count[i] = task_rlimit(current, RLIMIT_NPROC);
+	}
+
+	rcu_read_lock();
+	for_each_node(node) {
+		struct io_wqe_acct *acct;
+
+		for (i = 0; i < 2; i++) {
+			acct = &wq->wqes[node]->acct[i];
+			prev = max_t(int, acct->max_workers, prev);
+			if (new_count[i])
+				acct->max_workers = new_count[i];
+			new_count[i] = prev;
+		}
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
 static __init int io_wq_init(void)
 {
 	int ret;
diff --git a/fs/io-wq.h b/fs/io-wq.h
index 3999ee58ff26..8284d3b7ca75 100644
--- a/fs/io-wq.h
+++ b/fs/io-wq.h
@@ -128,6 +128,7 @@ void io_wq_enqueue(struct io_wq *wq, struct io_wq_work *work);
 void io_wq_hash_work(struct io_wq_work *work, void *val);
 
 int io_wq_cpu_affinity(struct io_wq *wq, cpumask_var_t mask);
+int io_wq_max_workers(struct io_wq *wq, int *new_count);
 
 static inline bool io_wq_is_hashed(struct io_wq_work *work)
 {
diff --git a/fs/io_uring.c b/fs/io_uring.c
index a2e20a6fbfed..0dbf2e42c15d 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -10048,6 +10048,31 @@ static int io_unregister_iowq_aff(struct io_ring_ctx *ctx)
 	return io_wq_cpu_affinity(tctx->io_wq, NULL);
 }
 
+static int io_register_iowq_max_workers(struct io_ring_ctx *ctx,
+					void __user *arg)
+{
+	struct io_uring_task *tctx = current->io_uring;
+	__u32 new_count[2];
+	int i, ret;
+
+	if (!tctx || !tctx->io_wq)
+		return -EINVAL;
+	if (copy_from_user(new_count, arg, sizeof(new_count)))
+		return -EFAULT;
+	for (i = 0; i < ARRAY_SIZE(new_count); i++)
+		if (new_count[i] > INT_MAX)
+			return -EINVAL;
+
+	ret = io_wq_max_workers(tctx->io_wq, new_count);
+	if (ret)
+		return ret;
+
+	if (copy_to_user(arg, new_count, sizeof(new_count)))
+		return -EFAULT;
+
+	return 0;
+}
+
 static bool io_register_op_must_quiesce(int op)
 {
 	switch (op) {
@@ -10065,6 +10090,7 @@ static bool io_register_op_must_quiesce(int op)
 	case IORING_REGISTER_BUFFERS_UPDATE:
 	case IORING_REGISTER_IOWQ_AFF:
 	case IORING_UNREGISTER_IOWQ_AFF:
+	case IORING_REGISTER_IOWQ_MAX_WORKERS:
 		return false;
 	default:
 		return true;
@@ -10216,6 +10242,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode,
 			break;
 		ret = io_unregister_iowq_aff(ctx);
 		break;
+	case IORING_REGISTER_IOWQ_MAX_WORKERS:
+		ret = -EINVAL;
+		if (!arg || nr_args != 2)
+			break;
+		ret = io_register_iowq_max_workers(ctx, arg);
+		break;
 	default:
 		ret = -EINVAL;
 		break;
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index 79126d5cd289..bf79e0b53e19 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -306,6 +306,9 @@ enum {
 	IORING_REGISTER_IOWQ_AFF		= 17,
 	IORING_UNREGISTER_IOWQ_AFF		= 18,
 
+	/* set/get max number of workers */
+	IORING_REGISTER_IOWQ_MAX_WORKERS	= 19,
+
 	/* this goes last */
 	IORING_REGISTER_LAST
 };

johalun · 2021-08-29T15:49:15Z

Your test case is correct for as well. But I tried with only setting the IOSQE_ASYNC flag for reading from a tcp socket, but still several 100 threads are created. I will try the updated patch. IIUC, network read should be unbounded, correct?

johalun · 2021-08-29T15:57:39Z

By the way, this is a 24 CPU system so bounded should be limited to 96 threads IIUC (my queue depth 16K). By counting the threads during heavy load this seem to be correct.

axboe · 2021-08-29T16:44:27Z

Changed the test case to use a TCP socket, even though it really shouldn't make a difference. And it does the right thing there for me too. If I don't set IOSQE_ASYNC, then they are all queued in polled mode, which doesn't create any workers (that's expected). If I set IOSQE_ASYNC, then it limits the workers as instructed:

[root@archlinux liburing]# test/max_workers 1024 4
connect fd 5
Setting max workers 1024 4
Max workers was 16 7923
Threads: 4
[root@archlinux liburing]# test/max_workers 1024 32
connect fd 5
Setting max workers 1024 32
Max workers was 16 7923
Threads: 32

Puzzled...

johalun · 2021-08-29T19:25:42Z

I merged the 5.15 io_uring branch in to a clean 5.14-rc7 but still the same. Even limit bound and unbound didn't change anything. I only have IOSQE_ASYNC on for TCP socket read but still.. :( By the way, I'm using io_uring_prep_XXX for all read/write etc.

johalun · 2021-08-29T19:41:19Z

I limit bound and unbound to 2 and still seeing over 60 iou_wrk thread without using IOSQE_ASYNC at all.

johalun · 2021-08-29T19:49:05Z

The weird thing is, your test works. Threads never exceed specified max value.

johalun · 2021-08-29T20:06:12Z

I saw in your test you never set sqe->flags = IOSQE_ASYNC. I added that to the test but still it works as expected while my program does not.. Hmm..

axboe · 2021-08-29T20:38:35Z

IOSQE_ASYNC isn't needed for the pipe test, just for the socket one. But yeah, either one works for me, pipes or sockets. And they really should, since there's no difference at that level.

Hence I am a bit puzzled on what is going on at your end. Are you sure you're setting it on the right ring, if you have more than one? Maybe describe your setup a bit, it might help me understand why it behaves differently.

axboe · 2021-08-29T20:42:41Z

Let me generate a small incremental patch that just adds some tracing to this.

johalun · 2021-08-29T20:50:07Z

It's just a single ring, single thread. A couple of listeners accepting incoming TCP connections. I got wrk running as client on another host, spamming http requests.
Init is simple:

--SNIP--
/* Temporary until we got libc or liburing support */
# ifndef __NR_io_uring_register
#  define __NR_io_uring_register	427
# endif
#include <sys/syscall.h>
int __sys_io_uring_register(int fd, unsigned opcode, const void *arg,
			    unsigned nr_args)
{
	return syscall(__NR_io_uring_register, fd, opcode, arg, nr_args);
}

--SNIP
    int ret;
    ret = io_uring_queue_init(QUEUE_DEPTH, &ring, 0);
    if (ret < 0) {
	debug(150, 0)("%s: queue init failed: %s\n", __func__, strerror(-ret));
	exit(-1);
    }

#define IORING_REGISTER_IOWQ_MAX 19 /* Available in Linux 5.15 */
    int worker_limit[2];
    worker_limit[0] = 2; 
    worker_limit[1] = 2;
    ret = __sys_io_uring_register(ring.ring_fd, IORING_REGISTER_IOWQ_MAX, worker_limit, 2);
    if (ret < 0)
	debug(0, 0) ("%s: ERROR: %s\n", __func__, strerror(errno));
    debug(0, 0) ("%s: Max workers was %d, %d\n", __func__, worker_limit[0], worker_limit[1]);
    worker_limit[0] = 0;
    worker_limit[1] = 0;
    ret = __sys_io_uring_register(ring.ring_fd, IORING_REGISTER_IOWQ_MAX, worker_limit, 2);
    if (ret < 0)
	debug(0, 0) ("%s: ERROR: %s\n", __func__, strerror(errno));
    debug(0, 0) ("%s: Max workers is now %d, %d\n", __func__, worker_limit[0], worker_limit[1]);
--SNIP--

I was playing around with ulimit -u and it does respect that limit but not the values I set..
After accepting the new connection, I do io_uring_prep_read(...) with IOSQE_ASYNC flag and then a thread gets created for each read submission.

axboe · 2021-08-29T20:51:44Z

Try this one:

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 4b5fc621ab39..5aa5dd68a63c 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -17,6 +17,9 @@
 
 #include "io-wq.h"
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/io_uring.h>
+
 #define WORKER_IDLE_TIMEOUT	(5 * HZ)
 
 enum {
@@ -262,6 +265,7 @@ static void io_wqe_wake_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
 			do_create = true;
 		}
 		raw_spin_unlock_irq(&wqe->lock);
+		trace_io_uring_wq(0, do_create, first, acct->nr_workers, acct->max_workers);
 		if (do_create) {
 			atomic_inc(&acct->nr_running);
 			atomic_inc(&wqe->wq->worker_refs);
@@ -297,6 +301,7 @@ static void create_worker_cb(struct callback_head *cb)
 		do_create = true;
 	}
 	raw_spin_unlock_irq(&wqe->lock);
+	trace_io_uring_wq(1, do_create, first, acct->nr_workers, acct->max_workers);
 	if (do_create) {
 		create_io_worker(wq, wqe, worker->create_index, first);
 	} else {
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index e4e44a2b4aa9..7291e231be5a 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -497,6 +497,33 @@ TRACE_EVENT(io_uring_task_run,
 		  (unsigned long long) __entry->user_data)
 );
 
+TRACE_EVENT(io_uring_wq,
+
+	TP_PROTO(int nr, int do_create, int first, int nr_w, int max_w),
+
+	TP_ARGS(nr, do_create, first, nr_w, max_w),
+
+	TP_STRUCT__entry (
+		__field(  int,		nr		)
+		__field(  int,		do_create	)
+		__field(  int,		first		)
+		__field(  int,		nr_w		)
+		__field(  int,		max_w		)
+	),
+
+	TP_fast_assign(
+		__entry->nr		= nr;
+		__entry->do_create	= do_create;
+		__entry->first		= first;
+		__entry->nr_w		= nr_w;
+		__entry->max_w		= max_w;
+	),
+
+	TP_printk("nr %d, do_create %d, first %d, nr_w %d, max_w %d\n",
+		  __entry->nr, __entry->do_create, __entry->first,
+		  __entry->nr_w, __entry->max_w)
+);
+
 #endif /* _TRACE_IO_URING_H */
 
 /* This part must be outside protection */

And then after booting and before running your test do:

echo 1 > /sys/kernel/debug/tracing/events/io_uring/io_uring_wq/enable

Run the test that shows too many workers, then attach a compressed copy of /sys/kernel/debug/tracing/trace in here.

axboe · 2021-08-29T20:52:24Z

Not sure what happened to the formatting there... Anyway, if you run the above trace addition, hopefully it'll shed some light on why it doesn't work.

axboe · 2021-08-29T20:54:08Z

Patch that actually links...:

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 4b5fc621ab39..549f5a57192a 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -17,6 +17,8 @@
 
 #include "io-wq.h"
 
+#include <trace/events/io_uring.h>
+
 #define WORKER_IDLE_TIMEOUT	(5 * HZ)
 
 enum {
@@ -262,6 +264,7 @@ static void io_wqe_wake_worker(struct io_wqe *wqe, struct io_wqe_acct *acct)
 			do_create = true;
 		}
 		raw_spin_unlock_irq(&wqe->lock);
+		trace_io_uring_wq(0, do_create, first, acct->nr_workers, acct->max_workers);
 		if (do_create) {
 			atomic_inc(&acct->nr_running);
 			atomic_inc(&wqe->wq->worker_refs);
@@ -297,6 +300,7 @@ static void create_worker_cb(struct callback_head *cb)
 		do_create = true;
 	}
 	raw_spin_unlock_irq(&wqe->lock);
+	trace_io_uring_wq(1, do_create, first, acct->nr_workers, acct->max_workers);
 	if (do_create) {
 		create_io_worker(wq, wqe, worker->create_index, first);
 	} else {
diff --git a/include/trace/events/io_uring.h b/include/trace/events/io_uring.h
index e4e44a2b4aa9..ae56d6b2dd3c 100644
--- a/include/trace/events/io_uring.h
+++ b/include/trace/events/io_uring.h
@@ -497,6 +497,33 @@ TRACE_EVENT(io_uring_task_run,
 		  (unsigned long long) __entry->user_data)
 );
 
+TRACE_EVENT(io_uring_wq,
+
+	TP_PROTO(int nr, int do_create, int first, int nr_w, int max_w),
+
+	TP_ARGS(nr, do_create, first, nr_w, max_w),
+
+	TP_STRUCT__entry (
+		__field(  int,		nr		)
+		__field(  int,		do_create	)
+		__field(  int,		first		)
+		__field(  int,		nr_w		)
+		__field(  int,		max_w		)
+	),
+
+	TP_fast_assign(
+		__entry->nr		= nr;
+		__entry->do_create	= do_create;
+		__entry->first		= first;
+		__entry->nr_w		= nr_w;
+		__entry->max_w		= max_w;
+	),
+
+	TP_printk("nr %d, do_create %d, first %d, nr_w %d, max_w %d",
+		  __entry->nr, __entry->do_create, __entry->first,
+		  __entry->nr_w, __entry->max_w)
+);
+
 #endif /* _TRACE_IO_URING_H */
 
 /* This part must be outside protection */

axboe · 2021-08-30T23:46:49Z

At least that's something! Can you add this one on top as well? Curious if this is hashing related or not.

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7cc458e0b636..1c7c8a623107 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1381,7 +1381,7 @@ static void io_prep_async_work(struct io_kiocb *req)
 
 	if (req->flags & REQ_F_ISREG) {
 		if (def->hash_reg_file || (ctx->flags & IORING_SETUP_IOPOLL))
-			io_wq_hash_work(&req->work, file_inode(req->file));
+			; // io_wq_hash_work(&req->work, file_inode(req->file));
 	} else if (!req->file || !S_ISBLK(file_inode(req->file)->i_mode)) {
 		if (def->unbound_nonreg_file)
 			req->work.flags |= IO_WQ_WORK_UNBOUND;

johalun · 2021-08-31T00:23:51Z

No change

axboe · 2021-08-31T01:02:51Z

Ok thanks for testing, that's a helpful datapoint

axboe · 2021-08-31T01:41:58Z

normally be at 100% cpu util serving at least 1500 req/s, it's instead at 30%, serving 300 req/s.

Is this compared to not using IOSQE_ASYNC? Can you run a perf record -a -g -- sleep 3 while it's doing those 300 req/s and attach the perf report -g --no-children to here afterwards?

johalun · 2021-08-31T03:37:52Z

Is this compared to not using IOSQE_ASYNC?

Yes, everything is the same except enable/disable IOSQE_ASYNC for TCP socket reads only. Emailed the perf data to you.

axboe · 2021-08-31T17:17:01Z

Can you give the current for-5.15/io_uring branch a go?

I looked at your profile, and nothing really looks too busy. When you run with IOSQE_ASYNC, how many workers are active? Maybe try and do a perf record -g -p <pid of worker> -- sleep 3 while it's running and see what that looks like.

There's another potential solution here. I'm not particularly a big fan of IOSQE_ASYNC. It's fine for some cases, but it tends to be a bit inefficient in various ways (for pollable file types). Hence I've been thinking about a hybrid kind of model. Right now, these are the two outcomes of queueing a request against a socket:

IOSQE_ASYNC isn't set. The request is attempted, and if no data/space is available, internal poll is armed. When we get the trigger event, the request is queued for task_work based retry. We expect that to succeed, and hence this completes the request. This path is pretty efficient. It does mean that the task that originally submitted the request will ultimately end up re-doing it, and that can of course be a bottleneck.
IOSQE_ASYNC is set. The request is sent directly to an iou-wrk worker immediately when prepared. It'll do a blocking attempt, and hence if no data/space is available, it'll simply sleep on that condition. When woken, it completes the request. This is inefficient in that latency is added going to the worker, but also because you can end up with lots of iou-wrk threads simply sleeping on requests. When they are sleeping on a request, they cannot handle other work, obviously.

The hybrid model would combine the two. Let's call it IOSQE_ASYNC_HYBRID. If that is set, then we still send it directly to an iou-wrk worker. However, that worker will go through the same motions as the non-IOSQE_ASYNC path, and arm poll if we cannot read/write the socket. This leaves the worker free to perform other requests, as the async trigger will ensure it runs the work immediately when we get the notification.

An alternative approach is having IOSQE_ASYNC_HYBRID go through the normal submit attempt first, in the hope that data/space is available. If not, then it sends to iou-wrk and it arms poll. Slight twist on the above.

This would work for unbounded work that can get interrupted, it would not work for bounded work. But bounded work isn't pollable anyway, so the existing IOSQE_ASYNC is the right model for that.

johalun · 2021-08-31T17:32:04Z

I will give it a try. Yeah, that's what I'm seeing too. Nothing is really busy on the system, but requests per second is just way down. Some connections even timeout (with a 10s connection time limit set by wrk).

I've running with worker limit of 24 bound and 24 unbound for all test recently. I run ps -T -p pid while the test is running and I see the number of threads go from 3 (at start of program) to 51 consistently when the system gets busy (with or without IOSQE_ASYNC it spawns 48 workers).

johalun · 2021-08-31T17:55:29Z

Can you give the current for-5.15/io_uring branch a go?

Back to worst case again :( Correction regarding the last comment, with IOSQE_ASYNC, workers do come and go during the test. Without it, number of workers are consistently maxed out.

I didn't really see anything of interest with by limiting perf to a worker pid.

axboe · 2021-08-31T17:59:02Z

What sha did you run? I noticed a stupid last minute error made it in there... This should be the right HEAD:

commit 0a1d9c940ee44c9949bee2c5ae01e4abfa452695 (origin/for-5.15/io_uring, for-5.15/io_uring)
Author: Jens Axboe <axboe@kernel.dk>
Date:   Mon Aug 30 14:17:12 2021 -0600

    io-wq: split bounded and unbounded work into separate lists

johalun · 2021-08-31T18:04:22Z

That's the one. Maybe it's more or less like the other day. It kinds of depends on how much is cached and for cached objects, sendfile (splice) is being used. I cleared the cache and ran a few times again and got similar results as yesterday.

johalun · 2021-08-31T18:37:27Z

Hmm, found something interesting! So far have only enabled IOSQE_ASYNC for TCP socket reads. This time I tried enable it only for TCP socket writes. Now, performance is great, no issues like with for async read. I can also see in my flamegraph that io_write (for TCP) is not called anymore from main thread's call to io_uring_submit but instead from the worker threads.

johalun · 2021-08-31T19:22:53Z

Ok, for forced async writes it looks like it is still waiting a little bit in io_uring_submit when it really shouldn't (lots of work to do, but it doesn't do it so request / sec gets lower). Still way better and more stable than when forcing async for reads.

johalun · 2021-08-31T21:53:43Z

When forcing async for connect I'm getting connect timeouts in my application. Even at very low load from the client but never for individual requests.

johalun · 2021-08-31T23:08:07Z

I tried with SQPOLL and it seems the worker limit does not apply in that case, I'm getting 1000+ threads again. Edit: without using IOSQE_ASYNC at all, I'm still getting 250+ threads during load with SQPOLL.

axboe · 2021-08-31T23:18:25Z

I'll get to the rest tomorrow, but yeah we do need to handle it a bit differently for SQPOLL, it currently doesn't set the right spots. I'll queue up a patch for that and let you know, so you can test it. Thanks!

axboe · 2021-09-01T20:17:43Z

Can you try with this applied?

https://git.kernel.dk/cgit/linux-block/commit/?h=for-5.15/io_uring&id=0c5ffe4564fee58e1da6f61e93a49ba21b010c30

Totally untested...

SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: axboe/liburing#420 Reported-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>

johalun · 2021-09-01T20:50:52Z

Using SQPOLL now respects worker limit 👍🏻

axboe · 2021-09-01T22:02:27Z

Perfect, thanks for testing!

SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: axboe/liburing#420 Reported-by: Johannes Lundberg <johalun0@gmail.com> Tested-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>

SQPOLL has a different thread doing submissions, we need to check for that and use the right task context when updating the worker values. Just hold the sqd->lock across the operation, this ensures that the thread cannot go away while we poke at ->io_uring. Link: axboe/liburing#420 Fixes: 2e48005 ("io-wq: provide a way to limit max number of workers") Reported-by: Johannes Lundberg <johalun0@gmail.com> Tested-by: Johannes Lundberg <johalun0@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>

johalun · 2021-09-07T15:32:25Z

Hey! Have you had time to work on this anymore (the async read/write slowness issue), or the hybrid model you mentioned earlier?

axboe · 2021-09-07T23:31:49Z

I haven't yet had time to dive into that. I think the model is working much better with the most recent patches, though. I'm still interested in exploring the hybrid model, but I need to spend some time on that.

We should probably close out this issue, I don't like having multiple or never ending issues when things get fixed and new things prop up. Would be great if you could open a new one for what comes next.

axboe closed this as completed Sep 7, 2021

johalun mentioned this issue Sep 24, 2021

Worker threads idling/stalling when using IOSQE_ASYNC for network I/O #426

Open

franz1981 mentioned this issue Nov 23, 2021

Add IOSQE_ASYNC_HYBRID support netty/netty-incubator-transport-io_uring#127

Open

franz1981 mentioned this issue Jun 10, 2023

Raise iosqeAsyncThreshold TechEmpower/FrameworkBenchmarks#8252

Merged

Too many threads with IOSQE_ASYNC and bad performance without... #420

Too many threads with IOSQE_ASYNC and bad performance without... #420

Comments

johalun commented Aug 27, 2021

axboe commented Aug 27, 2021

johalun commented Aug 27, 2021

johalun commented Aug 27, 2021

axboe commented Aug 27, 2021

axboe commented Aug 27, 2021

axboe commented Aug 27, 2021

johalun commented Aug 27, 2021

axboe commented Aug 27, 2021

johalun commented Aug 27, 2021

axboe commented Aug 27, 2021

axboe commented Aug 27, 2021

johalun commented Aug 27, 2021

johalun commented Aug 29, 2021

axboe commented Aug 29, 2021

axboe commented Aug 29, 2021

axboe commented Aug 29, 2021

johalun commented Aug 29, 2021 • edited Loading

johalun commented Aug 29, 2021

axboe commented Aug 29, 2021

johalun commented Aug 29, 2021

johalun commented Aug 29, 2021

johalun commented Aug 29, 2021

johalun commented Aug 29, 2021

axboe commented Aug 29, 2021

axboe commented Aug 29, 2021

johalun commented Aug 29, 2021

axboe commented Aug 29, 2021

echo 1 > /sys/kernel/debug/tracing/events/io_uring/io_uring_wq/enable

axboe commented Aug 29, 2021

axboe commented Aug 29, 2021 • edited Loading

axboe commented Aug 30, 2021

johalun commented Aug 31, 2021

axboe commented Aug 31, 2021

axboe commented Aug 31, 2021

johalun commented Aug 31, 2021

axboe commented Aug 31, 2021

johalun commented Aug 31, 2021

johalun commented Aug 31, 2021 • edited Loading

axboe commented Aug 31, 2021

johalun commented Aug 31, 2021

johalun commented Aug 31, 2021 • edited Loading

johalun commented Aug 31, 2021

johalun commented Aug 31, 2021

johalun commented Aug 31, 2021 • edited Loading

axboe commented Aug 31, 2021

axboe commented Sep 1, 2021

johalun commented Sep 1, 2021

axboe commented Sep 1, 2021

johalun commented Sep 7, 2021

axboe commented Sep 7, 2021

johalun commented Aug 29, 2021 •

edited

Loading

axboe commented Aug 29, 2021 •

edited

Loading

johalun commented Aug 31, 2021 •

edited

Loading

johalun commented Aug 31, 2021 •

edited

Loading

johalun commented Aug 31, 2021 •

edited

Loading