Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System freeze after suspending two instances of Neovim with io_uring enabled #1113

Closed
pajlada opened this issue Mar 31, 2024 · 13 comments
Closed

Comments

@pajlada
Copy link

pajlada commented Mar 31, 2024

Hi!
When running Arch Linux or Fedora Rawhide and suspending two instances of Neovim, which uses libuv, which uses io_uring, I experience a system freeze. It stops me from typing anything in any shell, or spawn any new shell, but I'm able to run some simple commands over ssh (e.g. ssh myserver ls -la).
dmesg doesn't report anything interesting as far as I could tell, other than some of the apps that were running not being responsive.

Mar 31 11:53:31 billy kernel: INFO: task st:29757 blocked for more than 122 seconds.
Mar 31 11:53:31 billy kernel:       Tainted: P           OE      6.8.2-arch2-1 #1
Mar 31 11:53:31 billy kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 31 11:53:31 billy kernel: task:st              state:D stack:0     pid:29757 tgid:29757 ppid:29751  flags:0x00004006
Mar 31 11:53:31 billy kernel: Call Trace:
Mar 31 11:53:31 billy kernel:  <TASK>
Mar 31 11:53:31 billy kernel:  __schedule+0x3e6/0x1520
Mar 31 11:53:31 billy kernel:  schedule+0x32/0xd0
Mar 31 11:53:31 billy kernel:  schedule_timeout+0x151/0x160
Mar 31 11:53:31 billy kernel:  wait_for_completion+0x86/0x170
Mar 31 11:53:31 billy kernel:  __flush_work.isra.0+0x173/0x280
Mar 31 11:53:31 billy kernel:  ? __pfx_wq_barrier_func+0x10/0x10
Mar 31 11:53:31 billy kernel:  n_tty_poll+0x134/0x1e0
Mar 31 11:53:31 billy kernel:  tty_poll+0x57/0xc0
Mar 31 11:53:31 billy kernel:  do_select+0x362/0x880
Mar 31 11:53:31 billy kernel:  ? pollwake+0x50/0xa0
Mar 31 11:53:31 billy kernel:  ? __pfx_pollwake+0x10/0x10
Mar 31 11:53:31 billy kernel:  ? __pfx_pollwake+0x10/0x10
Mar 31 11:53:31 billy kernel:  ? __pfx_pollwake+0x10/0x10
Mar 31 11:53:31 billy kernel:  core_sys_select+0x36b/0x530
Mar 31 11:53:31 billy kernel:  do_pselect.constprop.0+0xe9/0x180
Mar 31 11:53:31 billy kernel:  __x64_sys_pselect6+0x3d/0x70
Mar 31 11:53:31 billy kernel:  do_syscall_64+0x86/0x170
Mar 31 11:53:31 billy kernel:  ? do_syscall_64+0x96/0x170
Mar 31 11:53:31 billy kernel:  ? do_syscall_64+0x96/0x170
Mar 31 11:53:31 billy kernel:  ? exc_page_fault+0x7f/0x180
Mar 31 11:53:31 billy kernel:  entry_SYSCALL_64_after_hwframe+0x6e/0x76
Mar 31 11:53:31 billy kernel: RIP: 0033:0x7541e536b640
Mar 31 11:53:31 billy kernel: RSP: 002b:00007ffec990be70 EFLAGS: 00000202 ORIG_RAX: 000000000000010e
Mar 31 11:53:31 billy kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007541e536b640
Mar 31 11:53:31 billy kernel: RDX: 0000000000000000 RSI: 00007ffec990bf50 RDI: 0000000000000005
Mar 31 11:53:31 billy kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 00007ffec990beb0
Mar 31 11:53:31 billy kernel: R10: 0000000000000000 R11: 0000000000000202 R12: bff0000000000000
Mar 31 11:53:31 billy kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000010
Mar 31 11:53:31 billy kernel:  </TASK>

The freeze doesn't occur after disabling io_uring in libuv using UV_USE_IO_URING=0 or in the kernel with sysctl kernel.io_uring_disabled=1

uname -a from the tested systems

  • Linux billy 6.6.23-1-lts #1 SMP PREEMPT_DYNAMIC Wed, 27 Mar 2024 07:47:20 +0000 x86_64 GNU/Linux running Arch Linux
  • Linux yolen 6.8.2-arch2-1 #1 SMP PREEMPT_DYNAMIC Thu, 28 Mar 2024 17:06:35 +0000 x86_64 GNU/Linux running Arch Linux
  • Linux localhost 6.9.0-0.rc1.20240329git317c7bc0ef03.20.fc41.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Mar 29 14:04:53 UTC 2024 x86_64 GNU/Linux running Fedora Rawhide

Reproduction steps

  • Be on Arch Linux (bug reproducible on netcup & hetzner servers) or Fedora Rawhide
  • Install Neovim (pacman -S neovim)
  • Run Neovim
  • Suspend Neovim (by pressing CTRL+Z)
  • Run Neovim again
  • Suspend Neovim (by pressing CTRL+Z)
  • Your system is now most likely unresponsive

The suspension can be done in separate shells, or as different users with the same results.

Video showing off the freeze

nvim-suspend-freeze.mp4

I'm still able to run certain apps on the system, but not open a shell

If the io-uring@vger.kernel.org email is a better place for this report let me know and I'll report it there instead.
Originally reported in libuv/libuv#4377

@axboe
Copy link
Owner

axboe commented Apr 1, 2024

Funky! Thanks for the report, I'll take a look.

@axboe
Copy link
Owner

axboe commented Apr 1, 2024

Was able to reproduce the stall. Wow, libuv does some funky stuff.

@axboe
Copy link
Owner

axboe commented Apr 1, 2024

Just to be clear, this is obviously an io_uring bug, regardless of what libuv does!

@pajlada
Copy link
Author

pajlada commented Apr 1, 2024

Awesome! Glad to hear you were able to reproduce it. Thank you!

@axboe
Copy link
Owner

axboe commented Apr 1, 2024

Unsure if you're able to test kernel patches, but I believe the below should do it. Looks like we get into an inversion between the events workqueue being flushed for console output, and io_uring ring exits for some weird cases. If not, then I'll get it into 6.9-rc3 end of this week and it can bubble back to stable from there.

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 5d4b448fdc50..f6277e029d5f 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -147,6 +147,7 @@ static bool io_uring_try_cancel_requests(struct io_ring_ctx *ctx,
 static void io_queue_sqe(struct io_kiocb *req);
 
 struct kmem_cache *req_cachep;
+static struct workqueue_struct *iou_wq __ro_after_init;
 
 static int __read_mostly sysctl_io_uring_disabled;
 static int __read_mostly sysctl_io_uring_group = -1;
@@ -3161,7 +3162,7 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx)
 	 * noise and overhead, there's no discernable change in runtime
 	 * over using system_wq.
 	 */
-	queue_work(system_unbound_wq, &ctx->exit_work);
+	queue_work(iou_wq, &ctx->exit_work);
 }
 
 static int io_uring_release(struct inode *inode, struct file *file)
@@ -4185,6 +4186,8 @@ static int __init io_uring_init(void)
 	io_buf_cachep = KMEM_CACHE(io_buffer,
 					  SLAB_HWCACHE_ALIGN | SLAB_PANIC | SLAB_ACCOUNT);
 
+	iou_wq = alloc_workqueue("iou_exit", WQ_UNBOUND, 64);
+
 #ifdef CONFIG_SYSCTL
 	register_sysctl_init("kernel", kernel_io_uring_disabled_table);
 #endif

@axboe
Copy link
Owner

axboe commented Apr 1, 2024

Oh, and if you want a Reported-by: tag in the commit, please do let me know and I'll update it with that. Just need an identity + email for that. Queued up:

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-6.9&id=d1a9cef84784f873b457f2622ca2415c4b4db748

@pajlada
Copy link
Author

pajlada commented Apr 1, 2024

A Reported-by tag would be appreciated yeah! My identity + email below:
Rasmus Karlsson <rasmus.karlsson@pajlada.com>

I'll see if I can test the patch, or if not test the rc3 kernel when that's released

@axboe
Copy link
Owner

axboe commented Apr 1, 2024

Perfect, commit updated:

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-6.9&id=e5444baa42e545bb929ba56c497e7f3c73634099

@pajlada
Copy link
Author

pajlada commented Apr 2, 2024

Just applied the patch to my 6.8.2 kernel on a system that previously experienced the issue and I can confirm that it fixes it. Thanks for the quick turnaround!

@redbaron
Copy link

redbaron commented Apr 2, 2024

out of curiousity, why did all shells freeze, but ssh didn't?

@ichernev
Copy link

ichernev commented Apr 2, 2024

Perfect, commit updated:

https://git.kernel.dk/cgit/linux/commit/?h=io_uring-6.9&id=e5444baa42e545bb929ba56c497e7f3c73634099

I tested this on top of arch linux kernel 6.8.2-arch2-2 and it seems to work. You have my permission to add Tested-by: Iskren Chernev <me@iskren.info>

@ichernev
Copy link

ichernev commented Apr 2, 2024

out of curiousity, why did all shells freeze, but ssh didn't?

For me ssh was freezing too. But I also failed to kill the nvim after it froze... maybe there are a few variations of this.

@axboe
Copy link
Owner

axboe commented Apr 3, 2024

Thanks everyone, patch will go upstream later this week, and it'll bubble back to -stable post that. Marking this one as closed as fix exists.

@axboe axboe closed this as completed Apr 3, 2024
torvalds pushed a commit to torvalds/linux that referenced this issue Apr 6, 2024
Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Kaz205 pushed a commit to Kaz205/linux that referenced this issue Apr 8, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
johnny-mnemonic pushed a commit to linux-ia64/linux-stable-rc that referenced this issue Apr 9, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
johnny-mnemonic pushed a commit to linux-ia64/linux-stable-rc that referenced this issue Apr 9, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
johnny-mnemonic pushed a commit to linux-ia64/linux-stable-rc that referenced this issue Apr 9, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
johnny-mnemonic pushed a commit to linux-ia64/linux-stable-rc that referenced this issue Apr 9, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
mj22226 pushed a commit to mj22226/linux that referenced this issue Apr 9, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Kaz205 pushed a commit to Kaz205/linux that referenced this issue Apr 9, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
johnny-mnemonic pushed a commit to linux-ia64/linux-stable-rc that referenced this issue Apr 10, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
johnny-mnemonic pushed a commit to linux-ia64/linux-stable-rc that referenced this issue Apr 10, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Whissi pushed a commit to Whissi/linux-stable that referenced this issue Apr 10, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
woodsts pushed a commit to woodsts/linux-stable that referenced this issue Apr 10, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
wanghao75 pushed a commit to openeuler-mirror/kernel that referenced this issue May 8, 2024
stable inclusion
from stable-v6.6.26
commit 6b9d49bcd97bfe2eed9ee69014fd977ed0d6b27d
bugzilla: https://gitee.com/openeuler/kernel/issues/I9MPZ8

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6b9d49bcd97bfe2eed9ee69014fd977ed0d6b27d

--------------------------------

commit 73eaa2b583493b680c6f426531d6736c39643bfb upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
honjow pushed a commit to 3003n/linux that referenced this issue May 13, 2024
commit 73eaa2b upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
tuxedo-bot pushed a commit to tuxedocomputers/linux that referenced this issue Jun 7, 2024
BugLink: https://bugs.launchpad.net/bugs/2065400

commit 73eaa2b583493b680c6f426531d6736c39643bfb upstream.

Rather than use the system unbound event workqueue, use an io_uring
specific one. This avoids dependencies with the tty, which also uses
the system_unbound_wq, and issues flushes of said workqueue from inside
its poll handling.

Cc: stable@vger.kernel.org
Reported-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Rasmus Karlsson <rasmus.karlsson@pajlada.com>
Tested-by: Iskren Chernev <me@iskren.info>
Link: axboe/liburing#1113
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Manuel Diewald <manuel.diewald@canonical.com>
Signed-off-by: Roxana Nicolescu <roxana.nicolescu@canonical.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants