Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

5.15-rc6 vs 5.10: iou-wrk-# spawned too much; iou-sqp-# hang; linked timeout returen -EINVAL #460

Closed
beldzhang opened this issue Oct 18, 2021 · 40 comments

Comments

@beldzhang
Copy link

We are convert our storage protocol to io_uring, on kernel 5.10 it worked perfectly, when try run on kernel 5.15, there are multi issues, I'm not sure there are related so I put them together.

The protocol has typical structure:
loop:

  1. wait header
  2. handle requests: read-and-send or recv-and-write

we use liburing 2.1, compared kernel 5.10 and 5.15-rc6:

issue 1:
when SQPOLL not set, lots of iou-wrk-#### spawned ( >100 ) when connections >= 2,
I think it could be limited by io_uring_register_iowq_max_workers()
It happens when the disk read/write via splice way, if worked by buffered way, is ok.

issue 2:
when SQPOLL is set, after program exit, iou-sqp-#### hung up at state D (disk sleep)
the program can not be KILL, only do a system reboot

issue 3:
every network recv/send was linked with a timeout SQE, sometimes this SQE linked to header recving returned -EINVAL by cqe.res, this never occurred on kernel 5.10.
for now I just treat it like ECANCELED and looks ok.

and other minor issues:
a) io_uring_submit() return 0 sometimes, is this an error?, we just try accept this as good, and looks everything good.

b) when timeout was set in io_uring_wait_cqe_timeout(), io_uring_wait_cqes(), then this calls can not be interrupted by signal either SIGINT or SIGTERM, this was found from liburing 0.7

more details can be provided if required, and any suggestion or code testing are welcome.

Thanks a lot for your hard working.

@isilence
Copy link
Collaborator

@beldzhang, let's try to take them one by one, but first I'll ask

  1. Check dmesg and paste if anything interesting (errors / warnings)
  2. Are they easily reproducible? Any way for me to run it, so I can reproduce myself?

@isilence
Copy link
Collaborator

and other minor issues: a) io_uring_submit() return 0 sometimes, is this an error?, we just try accept this as good, and looks everything good.

The problem might be in __io_uring_submit() not checking submitted, let's try out

@isilence
Copy link
Collaborator

issue 1: when SQPOLL not set, lots of iou-wrk-#### spawned ( >100 ) when connections >= 2, I think it could be limited by io_uring_register_iowq_max_workers() It happens when the disk read/write via splice way, if worked by buffered way, is ok.

What is your rlimit value for RLIMIT_NPROC?

@isilence
Copy link
Collaborator

issue 2: when SQPOLL is set, after program exit, iou-sqp-#### hung up at state D (disk sleep) the program can not be KILL, only do a system reboot

dmesg output may be useful. Also, can you get us a stack trace of that hung iou-sqp task?

cat /proc/<task-pid>/stack

May need to run several times to get anything meaningful if it's a live lock.

@isilence
Copy link
Collaborator

issue 3: every network recv/send was linked with a timeout SQE, sometimes this SQE linked to header recving returned -EINVAL by cqe.res, this never occurred on kernel 5.10. for now I just treat it like ECANCELED and looks ok.

Is it SQPOLL? Do you share SQPOLL backends between multiple io_uring instances (i.e. IORING_SETUP_ATTACH_WQ)?
My guess it's either submitted a not linked IORING_OP_LINK_TIMEOUT, e.g. because of some SQPOLL restrictions. Or you have a IORING_OP_LINK_TIMEOUT linked to another IORING_OP_LINK_TIMEOUT.

Will you be able to test kernel patches?

@beldzhang
Copy link
Author

some details:
a. only one ring to serve all connections
b. ring was initialized with entries = 512
c. io_.._max_workers inquried and squashed to (4,4) if greater

issue 1)
RLIMIT_NPROC = 15626
io_..._max_workers returned (1,4)
sorry for I remember wrong about the iou-wrk count, here is some highest count I got:

conn: iou-wrk
  1 : 1
  2 : 63
  5 : 217
 10 : 298

issue 2) related to issue 3: hung after timeout invalid happen
dmsg: output nothing
stack trace: (3 times, all the same)
sh(1006)---uring(4870)---{uring}(4872)

	# cat /proc/4872/stack 
	[<0>] io_uring_cancel_generic+0x22a/0x2c0
	[<0>] io_sq_thread+0x3d0/0x5a0
	[<0>] ret_from_fork+0x1f/0x30

issue 3)
it happened always, with or without SQPOLL.

1:    [  0.000534] S <3e144100> SKT_RECV_BUFFER..[0]: link=<3e14c200>, len=....32 
 :    [  0.000535]   <3e14c200> res=..-125 
1:    [  0.000594]   <3e144100> res=....32 
2:    [  0.000597] S <3e144100> SKT_SEND_BUFFER..[0]: link=<3e14c200>, len=..4128 
1:    [  0.000642]   <3e14c200> res=..-125 
2:    [  0.000642]   <3e144100> res=..4128 
3:    [  0.000643] S <3e144100> SKT_RECV_BUFFER..[0]: link=<3e14c200>, len=....32 
2:    [  0.000676]   <3e14c200> res=..-125 
3:    [  0.000676]   <3e144100> res=....32 
 :    [  0.000678] S <3e144100> SKT_SEND_BUFFER..[0]: link=<3e14c200>, len=..4128 
3: !! [  0.000687]   <3e14c200> res=...-22 
 :    [  0.000792]   <3e144100> res=..4128 

this is a dump of SQE/CQE sequences.
1, 2, 3 was order of SQEs and CQEs{request, timeout} pairs, it's overlapping I'm not sure it is following the spec.
in #1 and #2, the timeout CQE res was right, but #3 it returned -22(EINVAL)
I'm using a single object(contain timespec) to do this so in a certain time there are 2 timeout SQEs contain the same user-data and time-spec record. then I tested use different objects but it was the same behavior.


Check dmesg everytime, nothing about this part
has testing server and client(Windows version) if you interested
I can test kernel paches(though I know a very little of git command, use svn daily ...)

@isilence
Copy link
Collaborator

isilence commented Oct 19, 2021

Thanks for helping with that,

some details: a. only one ring to serve all connections b. ring was initialized with entries = 512 c. io_.._max_workers inquried and squashed to (4,4) if greater

issue 1) RLIMIT_NPROC = 15626 io_..._max_workers returned (1,4) sorry for I remember wrong about the iou-wrk count, here is some highest count I got:

To be clear on this one, when you get many iou-wrk workers, was you using io_uring_register_iowq_max_workers((4,4)?

It's expected for splice to be punted to iowq and create so many tasks, surely not ideal, but that's how it's for now. However, if io_uring_register_iowq_max_workers() doesn't limit it, it's a bug on our side.

issue 2) related to issue 3: hung after timeout invalid happen dmsg: output nothing stack trace: (3 times, all the same) sh(1006)---uring(4870)---{uring}(4872)

I hope your kernel has debug symbols. Can you do the following, please?

$ gdb /lib/modules/$(uname -r)/build/vmlinux
(gdb) l *(io_uring_cancel_generic+0x22a)

If you've built the kernel from sources you can do gdb <kernel-src>/vmlinux instead.

@isilence
Copy link
Collaborator

isilence commented Oct 19, 2021

reproduced issue 3), IOW we'll fix it soon

edit: not as soon, it's not it, still needs debugging

@isilence
Copy link
Collaborator

b) when timeout was set in io_uring_wait_cqe_timeout(), io_uring_wait_cqes(), then this calls can not be interrupted by signal either SIGINT or SIGTERM, this was found from liburing 0.7

Do you handle those signals in the app? Do you mask a sigset, e.g. see arguments of io_uring_wait_cqes()?

@beldzhang
Copy link
Author

issue 1)
in my codes, io_..._max_workers was set to (1,4), I do a double check the values was set, so the io-wq was over created

BTW: after the prog successfully exit, the system load was increasing slowly, checked by htop but can not found which one caused this.

issue 2)

(gdb) l *(io_uring_cancel_generic+0x22a)
0xffffffff812cc9ba is in io_uring_cancel_generic (fs/io_uring.c:9793).
9788			 * If we've seen completions, retry without waiting. This
9789			 * avoids a race where a completion comes in before we did
9790			 * prepare_to_wait().
9791			 */
9792			if (inflight == tctx_inflight(tctx, !cancel_all))
9793				schedule();
9794			finish_wait(&tctx->wait, &wait);
9795		} while (1);
9796		atomic_dec(&tctx->in_idle);
9797

issue .b)
I set signal handler in my prog, this two calls can be interruptted normally when timeout set to NULL.
It was found in io_uring_wait_cqe_timeout(), then try with io_uring_wait_cqes() with sigmask, it's no lucky.


I will try to write a testing server/client in C to reproduce this issues.

@beldzhang
Copy link
Author

ok, I made a test case, it will produce issue 2) and 3)
just make and run server then cli_read
few seconds later, the error will shown

[nzel1:~/uring/c]# ./server 
STARTING IO_URING....
 ++ client-accepted: 5
        linked-timeout return -22
        linked-timeout return -22
        linked-timeout return -22
        linked-timeout return -22
        linked-timeout return -22
        recv-head: err=0
^C
  1. server is accept only once, so it need re-run every time.
  2. if IOSQE_SETUP_SQPOLL removed from io_uring_queue_init(), server will exit with CTRL+C, after that the system load will increase to 2.00 slowly.

issue-460-v1.zip

@isilence
Copy link
Collaborator

@beldzhang, awesome. I think I have a repro for issue 1 (too many workers), I'll get back later.

@isilence
Copy link
Collaborator

isilence commented Oct 19, 2021

@beldzhang, found one problem relevant to issue 3, but it's not necessarily the one you have.

  1. Do you call io_uring_register_iowq_max_workers() before issuing any request?
  2. Do you call it from the thread/process that created the ring?
  3. Do you call it from the thread/process that submits requests?

The thing is that io_uring_register_iowq_max_workers() is per process, and with some additional rules, we need to improve it a bit.

Also, can you tell your full name and email, so we can attach a reported-by tag to commits to give you some credit?
e.g. Reported-by: Pavel Begunkov <my_email@example.com>

@isilence
Copy link
Collaborator

@beldzhang, can you try a custom liburing? Want to get some more info.
https://github.com/isilence/liburing/tree/test-515

cd <some_folder>
git clone https://github.com/isilence/liburing.git
cd liburing
git checkout origin/test-515
make

Then you'd need to link it somehow. make install or better patch up a path to the library

@beldzhang
Copy link
Author

1. Do you call io_uring_register_iowq_max_workers() before issuing any request?
2. Do you call it from the thread/process that created the ring?
3. Do you call it from the thread/process that submits requests?

io_..._max_workers is called just after io_uring_queue_init() in the same thread, before any submits. and the ring is not shared.

for the custom liburing:

"sigmask set" not seen.
signal handling was differents with my previous post, it is interrupttable in 5.15, but not in 5.10.
tested with io_uring_wait_cqe_timeout().

following count was seen from a fresh booted system and first time run of testing server:

when SQPOLL = off
reenter ts 19 : (conn=5)

when SQPOLL = on
reenter ts 74: (conn=2)
reenter ts 1035: (conn=5)
zero submit!


it is my honour to get this credit, my full name on internet(not my passport :p) is: Beld Zhang <beldzhang@gmail.com>

@isilence
Copy link
Collaborator

The reproducer was pretty helpful thanks. Recapping:

issue 1: when SQPOLL not set, lots of iou-wrk-#### spawned ( >100 ) when connections >= 2, I think it could be limited by io_uring_register_iowq_max_workers() It happens when the disk read/write via splice way, if worked by buffered way, is ok.

There is a couple of new patches in the tree, I'd want you to try them.

issue 2: when SQPOLL is set, after program exit, iou-sqp-#### hung up at state D (disk sleep) the program can not be KILL, only do a system reboot

very likely it's because of issue 3

issue 3: every network recv/send was linked with a timeout SQE, sometimes this SQE linked to header recving returned -EINVAL by cqe.res, this never occurred on kernel 5.10. for now I just treat it like ECANCELED and looks ok.

Fixed, patch is sent.

and other minor issues: a) io_uring_submit() return 0 sometimes, is this an error?, we just try accept this as good, and looks everything good.

Can happen with SQPOLL and the userspace should be able to tolerate it, that's how liburing (not to mix with the kernel side) handles SQPOLL submissions. E.g. SQPOLL reaps some requests in the middle of submission and io_uring_submit() might return less than the number of SQE that had been in the queue.

b) when timeout was set in io_uring_wait_cqe_timeout(), io_uring_wait_cqes(), then this calls can not be interrupted by signal either SIGINT or SIGTERM, this was found from liburing 0.7

confirmed not to happen in 5.15

@beldzhang
Copy link
Author

beldzhang commented Oct 20, 2021

build and tested with this commit:
https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.15&id=46cb76b2f5ff39bf019bf7a072524fc7fe6deb01


issue 1) fixed

issue 2) still, can be reproduced by server/cli_read
stack trace was the same

issue 3) fixed

issue a) confirmed good
issue b) confirmed good on 5.15


issue 4) system load increasing slowly
after testing issue 1), load[1] increase 0.01 per second, until 2.00. before test it is ok. almost 0.0#
checked by htop: (show thread, show kernel thread), can not indicate which one caused this.
no any iou- wq left(checked by ps -A)
dmesg: nothing output after system boot.

@isilence
Copy link
Collaborator

build and tested with this commit:
https://git.kernel.dk/cgit/linux-block/commit/?h=io_uring-5.15&id=46cb76b2f5ff39bf019bf7a072524fc7fe6deb01

Sorry, I'd usually give more instructions and links, but was waiting until Jens takes the patch.

issue 1) fixed

perfect

issue 2) still, can be reproduced by server/cli_read stack trace was the same

Yeah, rebase broke it. Jens will hopefully fix the branch today, but if not I create a new one, basically Jens' io_uring-5.15~1 + that patch. Could you try it?
https://github.com/isilence/linux/tree/unprep

git clone https://github.com/isilence/linux.git
cd linux
git checkout origin/unprep
# and all the same further

issue 4) system load increasing slowly after testing issue 1), load[1] increase 0.01 per second, until 2.00. before test it is ok. almost 0.0# checked by htop: (show thread, show kernel thread), can not indicate which one caused this. no any iou- wq left(checked by ps -A) dmesg: nothing output after system boot.

Can you tell if that's still the case after retesting?

@beldzhang
Copy link
Author

test result of:
https://github.com/isilence/linux/tree/unprep

issue 1) good

issue 2) fixed

issue 3) good

issue 4) fixed

@isilence
Copy link
Collaborator

Great, all the problems are from 5.15, so we're lucky on that front.
@beldzhang thanks for reporting and helping with it.

Do you mind if we add Tested-by tag from you? Will also be attached to the commit

@beldzhang
Copy link
Author

Thanks for your hard working :)
and I'm really glad to have this tags.

So shall we close this one now? or wait until 5.15 released ?

@isilence
Copy link
Collaborator

5.15 will be released soon and you already tested fixes, so I think we can just close it.

ammarfaizi2 pushed a commit to ammarfaizi2/linux-block that referenced this issue Oct 20, 2021
Currently, IORING_REGISTER_IOWQ_MAX_WORKERS applies only to the task
that issued it, it's unexpected for users. If one task creates a ring,
limits workers and then passes it to another task the limit won't be
applied to the other task.

Another pitfall is that a task should either create a ring or submit at
least one request for IORING_REGISTER_IOWQ_MAX_WORKERS to work at all,
furher complicating the picture.

Change the API, save the limits and apply to all future users. Note, it
should be done first before giving away the ring or submitting new
requests otherwise the result is not guaranteed.

Fixes: 2e48005 ("io-wq: provide a way to limit max number of workers")
Link: axboe/liburing#460
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51d0bae97180e08ab722c0d5c93e7439cfb6f697.1634683237.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
ammarfaizi2 pushed a commit to ammarfaizi2/linux-block that referenced this issue Oct 20, 2021
io_unprep_linked_timeout() is broken, first it needs to return back
REQ_F_ARM_LTIMEOUT, so the linked timeout is enqueued and disarmed. But
now we refcounted it, and linked timeouts may get not executed at all,
leaking a request.

Just kill the unprep optimisation.

Fixes: 906c6ca ("io_uring: optimise io_prep_linked_timeout()")
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51b8e2bfc4bea8ee625cf2ba62b2a350cc9be031.1634719585.git.asml.silence@gmail.com
Link: axboe/liburing#460
Reported-by: Beld Zhang <beldzhang@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
@beldzhang beldzhang reopened this Oct 28, 2021
@beldzhang
Copy link
Author

issue 1) on 5.15-rc7.
more details: we are making a diskless server for Windows and Linux clients. the clients read and write blocks from server, range was from 512 Byte to 256 KB. one server will handle 1000+ clients.
for performance and stabilities, multi rings was used, one ring will handle a certain number of connections.

issue 1.1) io-wq workers exceed on SMP system
rc7 is good on my laptop, then start testing on server.

this is sqpoll is on:

uring─┬─4*[{iou-sqp-1642}]
      ├─106*[{iou-wrk-1645}]
      ├─2*[{iou-wrk-1648}]
      ├─105*[{iou-wrk-1651}]
      ├─2*[{iou-wrk-1657}]
      └─5*[{uring}]

this is sqpoll is off:

uring─┬─2*[{iou-wrk-5115}]
      ├─2*[{iou-wrk-5118}]
      ├─2*[{iou-wrk-5121}]
      ├─125*[{iou-wrk-5124}]
      └─5*[{uring}]

all the same except hardware difference:

  • laptop: i7-8850H, 32GB, NIC: 1Gbps
  • server: Xeon E5-2670 *2, 64GB, NIC: 10Gbps

issue 1.2) side effect of io_..._max_workers(0, 0)
happen when sqpoll is off
after ring created, do following sequence to set and check workers count

io_..._max_workers: check..([0,0]) = ([48, 127581])
                    set to.([1,2]) = ([48, 127581])
                    verify ([0,0]) = ([1, 2])        <-- side effect

when indicate issue 1.1, I add a verify step, then the iou-wrk-#### became over created again, although the numbers was set.
before verify step added:

uring─┬─2*[{iou-wrk-7563}]
      ├─2*[{iou-wrk-7565}]
      ├─2*[{iou-wrk-7567}]
      ├─2*[{iou-wrk-7569}]
      └─5*[{uring}]

after:

uring─┬─63*[{iou-wrk-7595}]
      ├─45*[{iou-wrk-7597}]
      ├─50*[{iou-wrk-7599}]
      ├─21*[{iou-wrk-7601}]
      └─5*[{uring}]

issue 1.3) workers setting process wide or ring wide ?
I think the io-wq workers setting was by ring, but got following infos(found when sqpoll is off):

io_..._max_workers: check..([0,0]) = ([48, 127581]) <-- first ring
                    set to.([1,2]) = ([48, 127581])
                    verify ([0,0]) = ([1, 2])
io_..._max_workers: check..([0,0]) = ([1, 2]) <-------- second ring
                    set to.([1,2]) = ([1, 2])
                    verify ([0,0]) = ([1, 2])
io_..._max_workers: check..([0,0]) = ([1, 2]) <-------- the third
                    set to.([1,2]) = ([1, 2])
                    verify ([0,0]) = ([1, 2])

after the first ring created, following ring got the initial max-workers of first one setting.
after stop and re-run the program, it repeat the same.
looks like the workers setting is process wide.


dmesg is clean, no any error in program.

@isilence
Copy link
Collaborator

issue 1) on 5.15-rc7. more details: we are making a diskless server for Windows and Linux clients. the clients read and write blocks from server, range was from 512 Byte to 256 KB. one server will handle 1000+ clients. for performance and stabilities, multi rings was used, one ring will handle a certain number of connections.

issue 1.1) io-wq workers exceed on SMP system rc7 is good on my laptop, then start testing on server.

106 of io-wq workers, right? I see a problem with SQPOLL in the code.

issue 1.2) side effect of io_..._max_workers(0, 0) happen when sqpoll is off after ring created, do following sequence to set and check workers count

io_..._max_workers: check..([0,0]) = ([48, 127581])
                    set to.([1,2]) = ([48, 127581])
                    verify ([0,0]) = ([1, 2])        <-- side effect

when indicate issue 1.1, I add a verify step, then the iou-wrk-#### became over created again, although the numbers was set. before verify step added:

Will look what that is

issue 1.3) workers setting process wide or ring wide ? I think the io-wq workers setting was by ring, but got following infos(found when sqpoll is off):

It's per task (aka thread or process), don't like the API, but what you see is expected.

@beldzhang
Copy link
Author

issue 1.1)
this output was dump by pstree -t $(pidof uring)
so, yes 106 is io-wq workers threads count
in my laptop, they are all 2*[{iou-wrk-####}], but in server, they are exceed

issue 1.3)
there is one ring in one thread, all threads in one process.
I have a little confuse about it's by thread or by process then.


my program is written in Free Pascal, not familiar with C. I will try to make a more complete testing case.

@isilence
Copy link
Collaborator

issue 1.3) there is one ring in one thread, all threads in one process. I have a little confuse about it's by thread or by process then.

For the sake of this discussion all three name the same thing. Initially IORING_REGISTER_IOWQ_MAX_WORKERS was setting the limit only to the current task (thread or process that called io_uring_register), and io_uring was needed as a mediator, though there were exceptions. Then after your first issue report we added some other rules, and it's not simple to word out what it is now.

The problem that there is N:M relation between tasks and io_urings, and as it uses io_uring as a mediator the API will always be a bit confusing whatever we do.

my program is written in Free Pascal, not familiar with C. I will try to make a more complete testing case.

I see what may be leading to 1.1 and 1.2, so no worries.

@isilence
Copy link
Collaborator

@beldzhang, can you try out a branch?

https://github.com/isilence/linux/tree/iowq-limit_test

git clone https://github.com/isilence/linux.git
cd linux
git checkout origin/iowq-limit_test
...

@beldzhang
Copy link
Author

result of: https://github.com/isilence/linux/tree/iowq-limit_test

issue 1.1)
it get worse on the server hardware
4 rings, connections= [8,8,8,6], total 30

SQPOLL = on

[TESTING-x86_64:~]# pstree -t $(pidof uring)
uring─┬─4*[{iou-sqp-3212}]
      ├─86*[{iou-wrk-3215}]
      ├─57*[{iou-wrk-3218}]
      ├─56*[{iou-wrk-3221}]
      ├─34*[{iou-wrk-3227}]
      └─5*[{uring}]

SQPOLL = off

[TESTING-x86_64:~]# pstree -t $(pidof uring)
uring─┬─75*[{iou-wrk-6853}]
      ├─84*[{iou-wrk-6855}]
      ├─79*[{iou-wrk-6859}]
      ├─33*[{iou-wrk-6862}]
      └─5*[{uring}]

@isilence
Copy link
Collaborator

There was a stupid mistake. Force pushed the branch, can you try it again to confirm whether it fixes 1.2?.

I don't know what particularly happens with 1.1 and can't reproduce it. Do I get it right, you see the problem for both SQPOLL and non-SQPOLL modes, is that right? Also interesting if it's related to 1.2 and the branch fixes it.

@beldzhang
Copy link
Author

beldzhang commented Oct 29, 2021

result of: isilence/linux@72c7f89

issue 1.2) fixed
tested several times, not happen again.

issue 1.1) confirmed, still there
this is both for SQPOLL and non-SQPOLL.
except hardware different, the kernel config was different also, mainly is:

  • laptop PREEMPT=y
  • server PREEMPT_NONE=y

since the server is truly SMP(2 sockets), is there any thing related? or NUMA ?
I could add some kernel log to check somewhere, just let me know.

Is it related to 1.3? since the workers is not that simple with ring,
my way to setting this may cause something uncertain?

@beldzhang
Copy link
Author

Add some debug printk, found the max_workers bumped to RLIMIT_NPROC

change on 5.15-rc7

--- ../work/00-tmpfs/io-wq.c	2021-10-29 17:12:29.020248700 -0400
+++ fs/io-wq.c	2021-10-29 17:06:50.741514454 -0400
@@ -261,6 +261,9 @@
 	raw_spin_unlock(&wqe->lock);
 	atomic_inc(&acct->nr_running);
 	atomic_inc(&wqe->wq->worker_refs);
+
+	printk("uring-debug: io_wqe_create_worker(): %d vs %d\n", acct->nr_workers, acct->max_workers);
+
 	return create_io_worker(wqe->wq, wqe, acct->index);
 }
 
@@ -290,6 +293,7 @@
 	}
 	raw_spin_unlock(&wqe->lock);
 	if (do_create) {
+		printk("uring-debug: create_worker_cb(): %d vs %d\n", acct->nr_workers, acct->max_workers);
 		create_io_worker(wq, wqe, worker->create_index);
 	} else {
 		atomic_dec(&acct->nr_running);

dmesg (partial):

[  123.189722] uring-debug: io_wqe_create_worker(): 2 vs 2
[  123.190120] uring-debug: io_wqe_create_worker(): 2 vs 2
[  123.190199] uring-debug: io_wqe_create_worker(): 2 vs 2
[  125.565649] uring-debug: io_wqe_create_worker(): 1 vs 257437
[  125.612244] uring-debug: io_wqe_create_worker(): 1 vs 257437
[  125.612533] uring-debug: io_wqe_create_worker(): 2 vs 257437
[  125.628494] uring-debug: io_wqe_create_worker(): 3 vs 257437
[  125.930844] uring-debug: create_worker_cb(): 4 vs 257437
[  125.964480] uring-debug: create_worker_cb(): 5 vs 257437
[  125.967301] uring-debug: create_worker_cb(): 6 vs 257437

full dmesg output attached:
dmesg-max_workers-bumped.txt

@isilence
Copy link
Collaborator

Can you make sure you call io_uring_register_iowq_max_workers(1,4) (or whatever arguments were) at least once for each ring? The limit is per task, but how the it's propagated is a different story. Just guessing how your app is written and want to check a hypothesis.

@beldzhang
Copy link
Author

In my code, the io_uring_register_iowq_max_wokers(1,2) was just after the io_uring_queue_init_params(), they are wrapped to a single method of class.

after a reviews for the epoll to io_uring conversion, I think I'm misunderstanding of the "ring sharing":
I was create all the rings(call the above method) in a separated manage thread, and then pass each one to a working thread, so the ring is created in a task, and the queue loop is run in others.

then change the ring creation code to the begin of working thread, and tested with isilence/linux@72c7f89

non-SQPOLL: it's good now

SQPOLL: iou-wrk-### still over created, but if I change to following sequence, it became good then ...

io_uring_queue_init_param(...)
if io_uring_register_iowq_max_workers(0,0) = 0 then begin
    io_uring_register_iowq_max_workers(1,2);
    io_uring_register_iowq_max_workers(0,0);  // ioq-wrk-### created as desired after this line added
end

BTW, my habit is set the resources usage to the minimal, (1,2) is the smallest value for the ring to work.

@beldzhang
Copy link
Author

beldzhang commented Oct 30, 2021

@isilence emmm... any idea?

found a problem in io-wq.c: io_wq_max_workers(), do some patch, and issue 1.1 is fixed.

before:

[  127.307054] uring-debug: io_wq_max_workers: acct=00000000b5415868, node=0, idx=0, new=4, prev=64
[  127.307060] uring-debug: io_wq_max_workers: acct=00000000c59638b2, node=0, idx=1, new=8, prev=257451
[  127.307061] uring-debug: io_wq_max_workers: acct=00000000a51d5e92, node=1, idx=0, new=64, prev=257451
[  127.307063] uring-debug: io_wq_max_workers: acct=000000005a29a9ed, node=1, idx=1, new=257451, prev=257451

after fix:

[   43.310441] uring-debug: io_wq_max_workers: acct=000000007a3a1926, node=0, idx=0, new=4, prev=64
[   43.310446] uring-debug: io_wq_max_workers: acct=000000005d1d6a6b, node=0, idx=1, new=8, prev=257451
[   43.310448] uring-debug: io_wq_max_workers: acct=00000000ad946a86, node=1, idx=0, new=4, prev=64
[   43.310450] uring-debug: io_wq_max_workers: acct=00000000c816e00d, node=1, idx=1, new=8, prev=257451

temporary patch i tested

diff --git a/fs/io-wq.c b/fs/io-wq.c
index 8fcac49b7b13..bddabbe105d3 100644
--- a/fs/io-wq.c
+++ b/fs/io-wq.c
@@ -1306,7 +1306,8 @@ int io_wq_cpu_affinity(struct io_wq *wq, cpumask_var_t mask)
  */
 int io_wq_max_workers(struct io_wq *wq, int *new_count)
 {
-	int i, node, prev = 0;
+	int i, node;//, prev = 0;
+	int prev[IO_WQ_ACCT_NR];
 
 	BUILD_BUG_ON((int) IO_WQ_ACCT_BOUND   != (int) IO_WQ_BOUND);
 	BUILD_BUG_ON((int) IO_WQ_ACCT_UNBOUND != (int) IO_WQ_UNBOUND);
@@ -1317,6 +1318,9 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count)
 			new_count[i] = task_rlimit(current, RLIMIT_NPROC);
 	}
 
+	for (i = 0; i < IO_WQ_ACCT_NR; i++)
+		prev[i] = 0;
+
 	rcu_read_lock();
 	for_each_node(node) {
 		struct io_wqe *wqe = wq->wqes[node];
@@ -1325,13 +1329,18 @@ int io_wq_max_workers(struct io_wq *wq, int *new_count)
 		raw_spin_lock(&wqe->lock);
 		for (i = 0; i < IO_WQ_ACCT_NR; i++) {
 			acct = &wqe->acct[i];
-			prev = max_t(int, acct->max_workers, prev);
+			if (node == 0)
+				prev[i] = max_t(int, acct->max_workers, prev[i]);
 			if (new_count[i])
 				acct->max_workers = new_count[i];
-			new_count[i] = prev;
+			//printk("uring-debug: io_wq_max_workers: acct=%p, node=%d, idx=%d, new=%d, prev=%d\n", acct, node, i, new_count[i], prev[i]);
+			//new_count[i] = prev;
 		}
 		raw_spin_unlock(&wqe->lock);
 	}
+	for (i = 0; i < IO_WQ_ACCT_NR; i++)
+		new_count[i] = prev[i];
+
 	rcu_read_unlock();
 	return 0;
 }

@isilence
Copy link
Collaborator

isilence commented Nov 2, 2021

Awesome, makes sense, slipped from my head that there are multiple nodes, wanted to kill it off but forgot.
Would be great if you send a patch to io-uring@vger.kernel.org (don't forget to add signed-off-by), though we can take over if you don't have time.

@beldzhang
Copy link
Author

patch sent.

@beldzhang
Copy link
Author

I think this need to be patched to 5.15.x also, shall I send to somewhere else?

@axboe
Copy link
Owner

axboe commented Nov 3, 2021

It'll automatically go to 5.15.x, that's why I marked it with a fixes tag. You don't need to do anything else.

@beldzhang
Copy link
Author

@isilence
isilence/linux@72c7f89 still required as it fixed issue 1.2

@axboe
Copy link
Owner

axboe commented Nov 4, 2021

Pavel, are you sending that one in for 5.16?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants