-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance degradation starting from 5.7.16 #215
Comments
It's probably this one: commit 175852503b41b3b74ffd99789240fa21ea1b8584
which I did suspect might cause some perf issues... What's your benchmark? Would love to be able to run it and compare. There are ways we can improve this. |
The benchmarks programs are attached here: it's just a ping/pong setup simulating high throughput scenario. You will need to install libunwind lib for the server. On your server machine run: On the client machine run: you can play with |
I can confirm that 175852503b41b3b74ffd99789240fa21ea1b8584 causes the regression. |
I ran your test case this morning, and I agree it does cause a slowdown. It's somewhat annoying, as 99.9% of use cases don't need this, and the reasons are pretty complicated. Basically you only need this for some eventfd related setups, where the same application that does the IO also ends up blocking in the kernel waiting on the eventfd. The eventfd will trigger when completions are run, but some completions are not run until the application exits that loop and processes completions. One idea would be to pass in a flag for io_uring setup that says "I do weird stuff with eventfd, use the safe/slower path" and default to NOT doing this. Don't necessarily like that approach, but right now I can't think of a better way to handle this. |
Can you just document the exact flow as invalid/unsupported? It sounds
like a deadlock what you are describing.
Btw, I use eventfd polling to wake io_uring loop from other threads and it
works very well so i guess it applies to very specific use-case of eventfd.
Or maybe you mean - eventfd generated by io_uring completions?
…On Mon, Sep 28, 2020, 18:36 Jens Axboe ***@***.***> wrote:
I ran your test case this morning, and I agree it does cause a slowdown.
It's somewhat annoying, as 99.9% of use cases don't need this, and the
reasons are pretty complicated. Basically you only need this for some
eventfd related setups, where the same application that does the IO also
ends up blocking in the kernel waiting on the eventfd. The eventfd will
trigger when completions are run, but some completions are not run until
the application exits that loop and processes completions.
One idea would be to pass in a flag for io_uring setup that says "I do
weird stuff with eventfd, use the safe/slower path" and default to NOT
doing this. Don't necessarily like that approach, but right now I can't
think of a better way to handle this.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#215 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4BFCCHMARHEMGCKUTGEBLSICUORANCNFSM4R32LOTQ>
.
|
Note that this is a problem mainly for threaded cases, as threads share the signal group. To process the completion work, we need to lock the signal structure, which obviously turns to a mess if you have a lot of threads in the same group. |
What's this signal group? Can I allocate a signal group per thread?
…On Mon, Sep 28, 2020, 18:56 Jens Axboe ***@***.***> wrote:
Note that this is a problem mainly for threaded cases, as threads share
the signal group. To process the completion work, we need to lock the
signal structure, which obviously turns to a mess if you have a lot of
threads in the same group.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#215 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4BFCGOZFC7RHMEKTN7NNLSICW4BANCNFSM4R32LOTQ>
.
|
Can we detect it and set the flag in-kernel? |
You can't currently break that sharing, clone(2) could potentially work, but doesn't support that. |
Here's an example:
Basically we have a main task, M, that issues a poll request. It then spawns two threads, one (T1) which waits for this poll request to complete, and one (T2) which issues a write that triggers this condition. This write spawns a task_work item for M, this work needs to be processed for the completion CQE to be filled in. Task M then waits for T1 to exit. M ends up waiting in futex, and hence doesn't process the task_work required to have T1 receive the completion event. Hence we won't make any progress here unless we can interrupt the futex wait and process the completion. |
I'd love for that to be the case, do let me know if you have any good ideas here. Continuing on the above example, we can't detect that task M is in the kernel and waiting. And even if we could, we could have a race if it's just on the way to do just that. Would love to improve this case, it's very annoying to have to slow down the general case for some odd corner cases. It would be very important to fix. |
For this exact case, we could make it work by just doing:
but the problem is that this fixes just the one case that could trigger it. Maybe M would be waiting somewhere else. The canonical case is:
in essence, where we then never run the task work as we're not exiting the kernel and we need to process task_work for |
I understand the flow but I am probably missing something about the mechanics of io_uring behind the scene.
As a general note, I find io_uring data structure not very thread-safe from userland perspective: you can not issue |
There are only magical threads if we need them, in general any sort of deferred try/retry is done by something called task_work (let's shorten that to tw). tw is run by the task itself, which makes it a lot less expensive than punting to a different thread. The work is run on transition between kernel/user. So for your question 1+2, M has to process the poll request completion in the example above. Once T2 triggers the poll request, tw gets added to M. Once M runs this tw item, then the completion shows up in the ring and T1 gets happy and receives the completion event. That's why we need M to do something. For this poll example, we could potentially have T1 process the tw, but that's not necessarily the case for other work items, say if the work item needs to access the mm of M. This would work for T1 since it's a thread, but you could also have the above be M, P1, and P2 where P are forked processes. And yes, the rings are not thread safe, generally a thread should have its own ring(s). You can, however, get by with isolating the completion and submission side, those are independent in the raw API, not in liburing though. |
thank you for the explanation, I think it could be useful to add it to io_uring.pdf or maybe to man pages.
I understand, based on your explanation, that in a regular case (without
The reason I am asking is that I designed my io_uring loop under bit different assumptions - see below.
|
Something like this might be viable. If you have a chance to be able to test this, then that'd be great...
|
Sure, will test it during the weekend and let you know.
Btw, why is it only for x86? Will you apply similar changes to aarch64?
…On Wed, Sep 30, 2020, 23:51 Jens Axboe ***@***.***> wrote:
Something like this might be viable. If you have a chance to be able to
test this, then that'd be great...
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7101ac64bb20..79057cecf121 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -18,6 +18,7 @@ config X86_32
select MODULES_USE_ELF_REL
select OLD_SIGACTION
select GENERIC_VDSO_32
+ select ARCH_HAS_TIF_TW
config X86_64
def_bool y
@@ -30,6 +31,7 @@ config X86_64
select MODULES_USE_ELF_RELA
select NEED_DMA_MAP_STATE
select SWIOTLB
+ select ARCH_HAS_TIF_TW
config FORCE_DYNAMIC_FTRACE
def_bool y
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 267701ae3d86..79fe7db3208c 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -93,6 +93,7 @@ struct thread_info {
#define TIF_NOTSC 16 /* TSC is not accessible in userland */
#define TIF_IA32 17 /* IA32 compatibility process */
#define TIF_SLD 18 /* Restore split lock detection on context switch */
+#define TIF_TASKWORK 19 /* task_work pending */
#define TIF_MEMDIE 20 /* is terminating due to OOM killer */
#define TIF_POLLING_NRFLAG 21 /* idle is polling for TIF_NEED_RESCHED */
#define TIF_IO_BITMAP 22 /* uses I/O bitmap */
@@ -123,6 +124,7 @@ struct thread_info {
#define _TIF_NOTSC (1 << TIF_NOTSC)
#define _TIF_IA32 (1 << TIF_IA32)
#define _TIF_SLD (1 << TIF_SLD)
+#define _TIF_TASKWORK (1 << TIF_TASKWORK)
#define _TIF_POLLING_NRFLAG (1 << TIF_POLLING_NRFLAG)
#define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
#define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h
index 159c7476b11b..a8e3dbdfad63 100644
--- a/include/linux/entry-common.h
+++ b/include/linux/entry-common.h
@@ -69,7 +69,7 @@
#define EXIT_TO_USER_MODE_WORK \
(_TIF_SIGPENDING | _TIF_NOTIFY_RESUME | _TIF_UPROBE | \
- _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | \
+ _TIF_NEED_RESCHED | _TIF_PATCH_PENDING | _TIF_TASKWORK| \
ARCH_EXIT_TO_USER_MODE_WORK)
/**
diff --git a/include/linux/sched/signal.h b/include/linux/sched/signal.h
index 1bad18a1d8ba..470a1da5166b 100644
--- a/include/linux/sched/signal.h
+++ b/include/linux/sched/signal.h
@@ -355,7 +355,8 @@ static inline int restart_syscall(void)
static inline int signal_pending(struct task_struct *p)
{
- return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));
+ return unlikely(test_tsk_thread_flag(p, TIF_SIGPENDING) ||
+ test_tsk_thread_flag(p, TIF_TASKWORK));
}
static inline int __fatal_signal_pending(struct task_struct *p)
@@ -502,7 +503,8 @@ extern int set_user_sigmask(const sigset_t __user *umask, size_t sigsetsize);
static inline void restore_saved_sigmask_unless(bool interrupted)
{
if (interrupted)
- WARN_ON(!test_thread_flag(TIF_SIGPENDING));
+ WARN_ON(!test_thread_flag(TIF_SIGPENDING) &&
+ !test_thread_flag(TIF_TASKWORK));
else
restore_saved_sigmask();
}
diff --git a/include/linux/signal.h b/include/linux/signal.h
index 7bbc0e9cf084..4bf470d66791 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -6,6 +6,11 @@
#include <linux/signal_types.h>
#include <linux/string.h>
+#ifndef CONFIG_ARCH_HAS_TIF_TW
+#define TIF_TASK_WORK 0
+#define _TIF_TASK_WORK 0
+#endif
+
struct task_struct;
/* for sysctl */
diff --git a/init/Kconfig b/init/Kconfig
index d6a0b31b13dc..3f9c9dd69f84 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -2333,6 +2333,9 @@ config ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
config ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
bool
+config ARCH_HAS_TIF_TW
+ bool
+
# It may be useful for an architecture to override the definitions of the
# SYSCALL_DEFINE() and __SYSCALL_DEFINEx() macros in <linux/syscalls.h>
# and the COMPAT_ variants in <linux/compat.h>, in particular to use a
diff --git a/kernel/entry/common.c b/kernel/entry/common.c
index 6fdb6105e6d6..755e0aac7e2c 100644
--- a/kernel/entry/common.c
+++ b/kernel/entry/common.c
@@ -160,6 +160,9 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
if (ti_work & _TIF_SIGPENDING)
arch_do_signal(regs);
+ if (ti_work & _TIF_TASKWORK)
+ task_work_run();
+
if (ti_work & _TIF_NOTIFY_RESUME) {
clear_thread_flag(TIF_NOTIFY_RESUME);
tracehook_notify_resume(regs);
diff --git a/kernel/signal.c b/kernel/signal.c
index a38b3edc6851..3ed090798811 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -2529,6 +2529,9 @@ bool get_signal(struct ksignal *ksig)
struct signal_struct *signal = current->signal;
int signr;
+ if (unlikely(current->task_works))
+ task_work_run();
+
if (unlikely(uprobe_deny_signal()))
return false;
diff --git a/kernel/task_work.c b/kernel/task_work.c
index 613b2d634af8..de85a4c3f899 100644
--- a/kernel/task_work.c
+++ b/kernel/task_work.c
@@ -28,7 +28,6 @@ int
task_work_add(struct task_struct *task, struct callback_head *work, int notify)
{
struct callback_head *head;
- unsigned long flags;
do {
head = READ_ONCE(task->task_works);
@@ -41,7 +40,10 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
case TWA_RESUME:
set_notify_resume(task);
break;
- case TWA_SIGNAL:
+ case TWA_SIGNAL: {
+#ifndef CONFIG_ARCH_HAS_TIF_TW
+ unsigned long flags;
+
/*
* Only grab the sighand lock if we don't already have some
* task_work pending. This pairs with the smp_store_mb()
@@ -53,7 +55,11 @@ task_work_add(struct task_struct *task, struct callback_head *work, int notify)
signal_wake_up(task, 0);
unlock_task_sighand(task, &flags);
}
+#else
+ set_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
break;
+ }
}
return 0;
@@ -110,6 +116,11 @@ void task_work_run(void)
struct task_struct *task = current;
struct callback_head *work, *head, *next;
+#ifdef CONFIG_ARCH_HAS_TIF_TW
+ if (test_tsk_thread_flag(task, TIF_TASKWORK))
+ clear_tsk_thread_flag(task, TIF_TASKWORK);
+#endif
+
for (;;) {
/*
* work->func() can do task_work_add(), do not set
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#215 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4BFCCCQUZVUB2DE4MB3QDSIOK43ANCNFSM4R32LOTQ>
.
|
Thanks! I'm still fiddling around with this, it's just x86 only right now since a) it's arch dependent, and b) that's the only arch I can test on. There's nothing preventing similar changes on other archs. |
Here's a cleaned up version:
|
Committed it here for testing: https://git.kernel.dk/cgit/linux-block/log/?h=tif-task_work and I also added an arm64 patch for you. Also shows how trivial it is to add arch support for this. |
checking. meanwhile pls fix "defien" typo.
…On Thu, Oct 1, 2020 at 12:46 AM Jens Axboe ***@***.***> wrote:
Committed it here for testing:
https://git.kernel.dk/cgit/linux-block/log/?h=tif-task_work
and I also added an arm64 patch for you. Also shows how trivial it is to
add arch support for this.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#215 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA4BFCBQM2RGXMQCO5QLEOTSIORKZANCNFSM4R32LOTQ>
.
--
Best regards,
Roman
|
Ok, checked the latest fix Great job! P.S. Please backport the fix to 5.8 for the upcoming Ubuntu release. |
Awesome, thanks for testing! I've cleaned it up a bit more, and posted it for comments/review: https://lore.kernel.org/io-uring/0b5336a7-c975-a8f8-e988-e983e2340d99@kernel.dk/T/#u I forgot to add your Reported-by, I'll get that done for the next round. I'm sure we're not done yet... |
np, I've recently learnt that my company will appreciate if my work email will appear in the log. Can you please use romger@amazon.com from now on? |
You bet, I'll use that email. Once we have a final and you've vetted it, would be great if I could also add your Tested-by in there. |
It's queued up for 5.11 and will go into 5.10-stable once that's done. Closing this one out. |
…x-block Pull TIF_NOTIFY_SIGNAL updates from Jens Axboe: "This sits on top of of the core entry/exit and x86 entry branch from the tip tree, which contains the generic and x86 parts of this work. Here we convert the rest of the archs to support TIF_NOTIFY_SIGNAL. With that done, we can get rid of JOBCTL_TASK_WORK from task_work and signal.c, and also remove a deadlock work-around in io_uring around knowing that signal based task_work waking is invoked with the sighand wait queue head lock. The motivation for this work is to decouple signal notify based task_work, of which io_uring is a heavy user of, from sighand. The sighand lock becomes a huge contention point, particularly for threaded workloads where it's shared between threads. Even outside of threaded applications it's slower than it needs to be. Roman Gershman <romger@amazon.com> reported that his networked workload dropped from 1.6M QPS at 80% CPU to 1.0M QPS at 100% CPU after io_uring was changed to use TIF_NOTIFY_SIGNAL. The time was all spent hammering on the sighand lock, showing 57% of the CPU time there [1]. There are further cleanups possible on top of this. One example is TIF_PATCH_PENDING, where a patch already exists to use TIF_NOTIFY_SIGNAL instead. Hopefully this will also lead to more consolidation, but the work stands on its own as well" [1] axboe/liburing#215 * tag 'tif-task_work.arch-2020-12-14' of git://git.kernel.dk/linux-block: (28 commits) io_uring: remove 'twa_signal_ok' deadlock work-around kernel: remove checking for TIF_NOTIFY_SIGNAL signal: kill JOBCTL_TASK_WORK io_uring: JOBCTL_TASK_WORK is no longer used by task_work task_work: remove legacy TWA_SIGNAL path sparc: add support for TIF_NOTIFY_SIGNAL riscv: add support for TIF_NOTIFY_SIGNAL nds32: add support for TIF_NOTIFY_SIGNAL ia64: add support for TIF_NOTIFY_SIGNAL h8300: add support for TIF_NOTIFY_SIGNAL c6x: add support for TIF_NOTIFY_SIGNAL alpha: add support for TIF_NOTIFY_SIGNAL xtensa: add support for TIF_NOTIFY_SIGNAL arm: add support for TIF_NOTIFY_SIGNAL microblaze: add support for TIF_NOTIFY_SIGNAL hexagon: add support for TIF_NOTIFY_SIGNAL csky: add support for TIF_NOTIFY_SIGNAL openrisc: add support for TIF_NOTIFY_SIGNAL sh: add support for TIF_NOTIFY_SIGNAL um: add support for TIF_NOTIFY_SIGNAL ...
Hi Jens,
I noticed that there is serious performance regression using sockets with io_uring starting from v5.7.16.
Unfortunately, it continues into 5.8 and 5.9 I have not checked yet for which patch versions exactly.
perf top
shows exactly where the contention happens but I can not identify the root-cause.I am not sure if it's directly related to io_uring or some other change that caused the regression.
5.7.15 and earlier does not inhibit such contention for the same machine (18 physical cores). I am testing on Ubuntu 20.04 patched with prebuilt kernels from https://kernel.ubuntu.com/~kernel-ppa/mainline/
For 5.7.15 my benchmark achieves 1.6M qps and system cpu is at ~80%.
for 5.7.16 or later it achieves only 1M qps and the system cpu is is at ~100% .
The text was updated successfully, but these errors were encountered: