Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker threads idling/stalling when using IOSQE_ASYNC for network I/O #426

Open
johalun opened this issue Sep 7, 2021 · 21 comments
Open

Comments

@johalun
Copy link

johalun commented Sep 7, 2021

In my application I'm doing a lot of mixed disk and network I/O. Without using SQPOLL or IOSQE_ASYNC, the performance is pretty good but profiling show that the main thread spends a lot of time in io_write/io_read for the network I/O (calling into the TCP stack).

I wish improve performance by offloading also network I/O to worker threads by using IOSQE_ASYNC. It works so far that the io_read and io_write are indeed moved to a worker thread however the performance is terrible. Even during full load which would normally cause the main thread and worker threads to use significant CPU, the CPU usage is almost nothing at all. I don't know exactly what is going on but it's like the worker threads are sleeping and not getting notified that there's work for them to do. The effect is most noticeable on socket reads, compared to for example write, connect or close.

@HowHsu
Copy link

HowHsu commented Sep 24, 2021

Hi johalun,
which version of kernel are you using? Do the io-workers hang there for socket reads/writes?

@johalun
Copy link
Author

johalun commented Sep 24, 2021

This is with 5.14 and the latest patches from #420. Yes, this is only for socket read/writes (possibly connect/close also). Disk I/O is already offloaded to worker threads so no need for IOSQE_ASYNC there and disk I/O performs well.

@HowHsu
Copy link

HowHsu commented Sep 26, 2021

Did the number of worker threads exceed the registered limit with/without IOSQE_ASYNC now?

@HowHsu
Copy link

HowHsu commented Sep 26, 2021

Btw @axboe , Have you started working on IOSQE_ASYNC_HYBRID stuff, if not I'd like to give it a shot.

@axboe
Copy link
Owner

axboe commented Sep 26, 2021

@HowHsu By all means go ahead, knee deep in a bunch of other things. Thanks!

@johalun
Copy link
Author

johalun commented Sep 26, 2021

Did the number of worker threads exceed the registered limit with/without IOSQE_ASYNC now?

It worked as expected with the latest patches from #420. I'm not sure if they all have been committed or not yet. @axboe can probably confirm this.

@axboe
Copy link
Owner

axboe commented Sep 26, 2021

Those fixes are in mainline.

@HowHsu
Copy link

HowHsu commented Sep 29, 2021

Hi @johalun
Do you mind try the below two patches to see if it gets better:
(I developed them based on https://git.kernel.dk/linux-block for-5.16/io_uring)
Since this IOSQE_ASYNC_HYBRID flag hasn't been merged to kernel and liburing, you need to do:

sqe->flags |= (1 << 6);
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8317c360f7a4..9e84d273c4ec 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -104,7 +104,8 @@
 #define IORING_MAX_REG_BUFFERS (1U << 14)

 #define SQE_COMMON_FLAGS (IOSQE_FIXED_FILE | IOSQE_IO_LINK | \
-                         IOSQE_IO_HARDLINK | IOSQE_ASYNC)
+                         IOSQE_IO_HARDLINK | IOSQE_ASYNC | \
+                         IOSQE_ASYNC_HYBRID)

 #define SQE_VALID_FLAGS        (SQE_COMMON_FLAGS|IOSQE_BUFFER_SELECT|IOSQE_IO_DRAIN)

@@ -725,6 +726,7 @@ enum {
        REQ_F_HARDLINK_BIT      = IOSQE_IO_HARDLINK_BIT,
        REQ_F_FORCE_ASYNC_BIT   = IOSQE_ASYNC_BIT,
        REQ_F_BUFFER_SELECT_BIT = IOSQE_BUFFER_SELECT_BIT,
+       REQ_F_ASYNC_HYBRID_BIT  = IOSQE_ASYNC_HYBRID_BIT,

        /* first byte is taken by user flags, shift it to not overlap */
        REQ_F_FAIL_BIT          = 8,
@@ -762,6 +764,8 @@ enum {
        REQ_F_FORCE_ASYNC       = BIT(REQ_F_FORCE_ASYNC_BIT),
        /* IOSQE_BUFFER_SELECT */
        REQ_F_BUFFER_SELECT     = BIT(REQ_F_BUFFER_SELECT_BIT),
+       /* IOSQE_ASYNC_HYBRID */
+       REQ_F_ASYNC_HYBRID      = BIT(REQ_F_ASYNC_HYBRID_BIT),

        /* fail rest of links */
        REQ_F_FAIL              = BIT(REQ_F_FAIL_BIT),
diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
index b270a07b285e..fac037dbcf31 100644
--- a/include/uapi/linux/io_uring.h
+++ b/include/uapi/linux/io_uring.h
@@ -70,6 +70,7 @@ enum {
        IOSQE_IO_HARDLINK_BIT,
        IOSQE_ASYNC_BIT,
        IOSQE_BUFFER_SELECT_BIT,
+       IOSQE_ASYNC_HYBRID_BIT,
 };

 /*
@@ -87,7 +88,8 @@ enum {
 #define IOSQE_ASYNC            (1U << IOSQE_ASYNC_BIT)
 /* select buffer from sqe->buf_group */
 #define IOSQE_BUFFER_SELECT    (1U << IOSQE_BUFFER_SELECT_BIT)
-
+/* first force async then arm poll */
+#define IOSQE_ASYNC_HYBRID     (1U << IOSQE_ASYNC_HYBRID_BIT)
 /*
  * io_uring_setup() flags
  */
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9e84d273c4ec..089583cf73f7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5635,7 +5635,13 @@ static int io_arm_poll_handler(struct io_kiocb *req)
        req->apoll = apoll;
        req->flags |= REQ_F_POLLED;
        ipt.pt._qproc = io_async_queue_proc;
-       io_req_set_refcount(req);
+       /*
+        * REQ_F_REFCOUNT set indicate we are in io-worker context, where we
+        * already explicitly set the submittion and completion ref. So no
+        * need to set refcount here if that is the case.
+        */
+       if (!(req->flags & REQ_F_REFCOUNT))
+               io_req_set_refcount(req);

        ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask,
                                        io_async_wake);
@@ -6795,8 +6801,11 @@ static void io_wq_submit_work(struct io_wq_work *work)
                ret = -ECANCELED;

        if (!ret) {
+               bool need_poll = req->flags & REQ_F_ASYNC_HYBRID;
+
                do {
-                       ret = io_issue_sqe(req, 0);
+issue_sqe:
+                       ret = io_issue_sqe(req, need_poll ? IO_URING_F_NONBLOCK : 0);
                        /*
                        ¦* We can get EAGAIN for polled IO even though we're
                        ¦* forcing a sync submission from here, since we can't
@@ -6804,6 +6813,30 @@ static void io_wq_submit_work(struct io_wq_work *work)
                        ¦*/
                        if (ret != -EAGAIN)
                                break;
+                       if (need_poll) {
+                               bool armed = false;
+
+                               ret = 0;
+                               need_poll = false;
+
+                               switch (io_arm_poll_handler(req)) {
+                               case IO_APOLL_READY:
+                                       goto issue_sqe;
+                               case IO_APOLL_ABORTED:
+                                       /*
+                                        * somehow we failed to arm the poll infra,
+                                        * fallback it to a normal async worker try.
+                                        */
+                                       break;
+                               case IO_APOLL_OK:
+                                       armed = true;
+                                       break;
+                               }
+
+                               if (armed)
+                                       break;
+
+                       }
                        cond_resched();
                } while (1);
        }
@@ -7003,7 +7036,8 @@ static inline void io_queue_sqe(struct io_kiocb *req)
        if (unlikely(req->ctx->drain_active) && io_drain_req(req))
                return;

-       if (likely(!(req->flags & (REQ_F_FORCE_ASYNC | REQ_F_FAIL)))) {
+       if (likely(!(req->flags & (REQ_F_FORCE_ASYNC | REQ_F_ASYNC_HYBRID |
+                                  REQ_F_FAIL)))) {
                __io_queue_sqe(req);
        } else if (req->flags & REQ_F_FAIL) {
                io_req_complete_fail_submit(req);

@johalun
Copy link
Author

johalun commented Sep 29, 2021

@HowHsu Thanks for fast work! Unfortunately the behavior is the same with hybrid as the normal IOSQE_ASYNC.

@johalun
Copy link
Author

johalun commented Sep 30, 2021

OK I narrowed down the issue a bit more. If I only do read and writes from/to files and sockets, performance is good with IOSQE_ASYNC_HYBRID enabled for network read and writes. All read and writes are offloaded to worker threads and the main thread spends very little time in io_uring_submit. It could be the case that it is also good with IOSQE_ASYNC (not tested yet).

The issue starts when I use sendfile to send cached objects from disk to a client tcp socket. My sendfile is implemented using splice and you can see roughly how I did it here https://github.com/johalun/uring-sendfile/blob/main/server.c#L85-L161
Somehow, using splice cause the performance to drop significantly. Both CPU usage and throughput drops. I don't set IOSQE_ASYNC{_HYBRID} for the splice commands since those are already offloaded to worker threads.

@axboe
Copy link
Owner

axboe commented Oct 2, 2021

@johalun That seems to be a good clue, IORING_OP_SPLICE is a bit tricky in that where does it go in terms of bound vs unbound, and I'm guessing it ends up in the unbounded queue and needs to wait on someone else because you've hit your limit of unbounded workers?

Pondering how best to handle that, as you're dealing with both types of files for that. For your particular case, the socket write is probably trivial and can generally proceed. Which means it should end up in the bounded queue.

Does it make any difference if you mark the splice IOSQE_ASYNC or not?

@johalun
Copy link
Author

johalun commented Oct 4, 2021

I'm guessing it ends up in the unbounded queue and needs to wait on someone else because you've hit your limit of unbounded workers?

Yep, that seems to be the case. Increasing unbounded workers improved performance a lot! This is with latest patch for hybrid async which is enabled for all I/O, except splice. IOSQE_ASYNC, IOSQE_ASYNC_HYBRID or no flag makes no difference for splice.

How does the sendfile(2) implementation handle this? Does it spawn a new thread for each splice call? It seems at least with uring right now we need one unbound worker per concurrent splice if we don't want them waiting on each other.

@johalun
Copy link
Author

johalun commented Oct 4, 2021

On my server it seems 512 unbound workers is enough to reach max throughput while serving files from disk over http using IORING_OP_SPLICE. I find it weird though that I can reach almost the same performance just doing read/writes with uring with just a few worker threads. Is this just how splice is implemented or can it be called asynchronously (within the kernel) without the worker thread having to block?

@johalun
Copy link
Author

johalun commented Oct 8, 2021

I'm seeing stuck CPUs with ASYNC_HYBRID

[255000.961288] watchdog: BUG: soft lockup - CPU#3 stuck for 101s! [iou-wrk-15349:16380]
[255000.966584] Modules linked in:
[255000.975860] Modules linked in:
[255000.975860]  binfmt_misc
[255000.979885]  binfmt_misc
[255000.983908]  intel_rapl_msr
[255000.987336]  intel_rapl_msr
[255000.990746]  intel_rapl_common
[255000.994443]  intel_rapl_common
[255000.997287] watchdog: BUG: soft lockup - CPU#25 stuck for 101s! [iou-wrk-15349:16370]
[255000.997290] Modules linked in: binfmt_misc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp i
[255000.997342] CPU: 25 PID: 16370 Comm: iou-wrk-15349 Tainted: G      D      L    5.15.0-rc2+ #1
[255000.997345] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.12.1 12/07/2020
[255000.997346] RIP: 0010:native_queued_spin_lock_slowpath+0x195/0x1e0
[255000.997352] Code: c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 c0 9c 02 00 48 03 04 f5 80 da 09 a3 48 89 10 8b 427
[255000.997354] RSP: 0018:ffffab80c6b37b98 EFLAGS: 00000246
[255000.997355] RAX: 0000000000000000 RBX: ffff98ebd596a000 RCX: 0000000000680000
[255000.997357] RDX: ffff992a3fc69cc0 RSI: 0000000000000001 RDI: ffff98ebd596a480
[255000.997358] RBP: ffffab80c6b37b98 R08: 0000000000680000 R09: ffffab80c6b37a68
[255000.997360] R10: ffff992a3fc659c8 R11: 0000000000008000 R12: ffff9901e7ccc300
[255000.997361] R13: ffff98ebd596a480 R14: 0000000000008000 R15: 0000000000000000
[255000.997363] FS:  000014a219a25b80(0000) GS:ffff992a3fc40000(0000) knlGS:0000000000000000
[255000.997365] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[255000.997366] CR2: 0000564759ca5000 CR3: 0000001c9b7d6005 CR4: 00000000003706e0
[255000.997368] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[255000.997369] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[255000.997371] Call Trace:
[255000.997372]  _raw_spin_lock+0x1f/0x30
[255000.997375]  io_req_complete_post+0x32/0x350
[255000.997378]  kiocb_done+0x270/0x2a0
[255000.997381]  io_write+0x22a/0x360
[255000.997385]  ? update_cfs_group+0x9f/0xc0
[255000.997387]  ? dequeue_entity+0xe8/0x400
[255000.997389]  ? update_load_avg+0x1a6/0x5e0
[255000.997393]  ? reweight_entity+0x105/0x120
[255000.997396]  io_issue_sqe+0x933/0x1780
[255000.997398]  ? finish_task_switch+0xaa/0x290
[255000.997402]  ? io_issue_sqe+0x933/0x1780
[255000.997404]  ? lock_timer_base+0x6b/0x90
[255000.997408]  io_wq_submit_work+0x5c/0x100
[255000.997411]  io_worker_handle_work+0x22e/0x430
[255000.997414]  io_wqe_worker+0x103/0x370
[255000.997417]  ? finish_task_switch+0xaa/0x290
[255000.997420]  ? recalc_sigpending+0x1b/0x50
[255000.997422]  ? io_worker_handle_work+0x430/0x430
[255000.997425]  ret_from_fork+0x1f/0x30
[255000.997429] RIP: 0033:0x0
[255000.997430] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[255000.997431] RSP: 002b:0000000000000000 EFLAGS: 00000246
[255000.997433] RAX: 0000000000000000 RBX: 000055a37cb94de0 RCX: 0000000000000002
[255000.997434] RDX: 000055a37bce7dc0 RSI: 000000000001a290 RDI: 000000000000149f
[255000.997436] RBP: 00007ffeaf74e1a0 R08: 000000000001a290 R09: 0000000000000008
[255000.997437] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000421
[255000.997438] R13: 000055a39806a730 R14: 000055a38c1bc280 R15: 000055a37c67a7e0
[255000.998130]  sb_edac
[255001.002095]  sb_edac
[255001.006046]  x86_pkg_temp_thermal
[255001.015343]  x86_pkg_temp_thermal intel_powerclamp
[255001.095200]  intel_powerclamp
[255001.105391]  bonding coretemp
[255001.114523]  bonding
[255001.122108]  sch_fq_codel
[255001.144394]  coretemp
[255001.150936]  kvm_intel
[255001.159616]  sch_fq_codel
[255001.168286]  ipmi_ssif
[255001.176966]  kvm_intel
[255001.185629]  kvm
[255001.194285]  ipmi_ssif
[255001.204010]  crct10dif_pclmul
[255001.211113]  kvm
[255001.219766]  crc32_pclmul
[255001.228420]  crct10dif_pclmul
[255001.237074]  ghash_clmulni_intel
[255001.240495]  crc32_pclmul
[255001.245267]  aesni_intel
[255001.250717]  ghash_clmulni_intel
[255001.255292]  ledtrig_audio
[255001.259674]  aesni_intel
[255001.264831]  dell_smbios
[255001.269878]  ledtrig_audio
[255001.275115]  dell_wmi_descriptor
[255001.280351]  dell_smbios
[255001.285190]  crypto_simd
[255001.290619]  dell_wmi_descriptor
[255001.295654]  video
[255001.300685]  crypto_simd
[255001.305803]  cryptd
[255001.311413]  video
[255001.316234]  rapl
[255001.321645]  cryptd
[255001.326864]  dcdbas
[255001.332661]  rapl
[255001.337307]  mei_me
[255001.340888]  dcdbas
[255001.349233]  intel_cstate
[255001.355757]  mei_me
[255001.364408]  lpc_ich
[255001.373047]  intel_cstate
[255001.381663]  mei
[255001.390263]  lpc_ich
[255001.398843]  acpi_power_meter
[255001.401879]  mei
[255001.404888]  mac_hid
[255001.409147]  acpi_power_meter
[255001.415046]  ipmi_si
[255001.418894]  mac_hid
[255001.422727]  ipmi_devintf
[255001.425670]  ipmi_si
[255001.429079]  ipmi_msghandler
[255001.432092]  ipmi_devintf
[255001.435188]  ip_tables
[255001.438567]  ipmi_msghandler
[255001.441649]  x_tables
[255001.444728]  ip_tables
[255001.447220]  autofs4
[255001.450284]  x_tables
[255001.454014]  raid10
[255001.456479]  autofs4
[255001.459811]  raid0
[255001.463532]  raid10
[255001.467534]  multipath
[255001.470857]  raid0
[255001.474078]  linear
[255001.478076]  multipath
[255001.481491]  raid456 async_raid6_recov
[255001.484712]  linear
[255001.487928]  async_memcpy
[255001.491337]  raid456
[255001.495326]  async_pq
[255001.498541]  async_raid6_recov
[255001.501749]  async_xor
[255001.505735]  async_memcpy
[255001.508361]  async_tx
[255001.511568]  async_pq
[255001.514281]  xor
[255001.516897]  async_xor
[255001.519411]  raid6_pq
[255001.522115]  async_tx
[255001.524815]  libcrc32c
[255001.527316]  xor
[255001.530007]  raid1
[255001.532688]  raid6_pq
[255001.535941]  megaraid_sas
[255001.538606]  libcrc32c
[255001.541360]  mptsas
[255001.544600]  raid1
[255001.546965]  mptscsih mptbase
[255001.549711]  megaraid_sas
[255001.553318]  mgag200 i2c_algo_bit
[255001.555668]  mptsas
[255001.558406]  drm_kms_helper
[255001.562016]  mptscsih
[255001.564744]  syscopyarea
[255001.567475]  mptbase
[255001.570689]  sysfillrect
[255001.573418]  mgag200
[255001.576925]  sysimgblt
[255001.580139]  i2c_algo_bit
[255001.583056]  fb_sys_fops
[255001.586564]  drm_kms_helper
[255001.589392]  cec rc_core
[255001.592316]  syscopyarea
[255001.595049]  mxm_wmi
[255001.597877]  sysfillrect
[255001.600507]  cxgb4 drm
[255001.603226]  sysimgblt
[255001.605754]  mpt3sas
[255001.608378]  fb_sys_fops
[255001.611292]  tg3 ahci
[255001.613817]  cec
[255001.616435]  raid_class
[255001.619345]  rc_core
[255001.623806]  libahci
[255001.626429]  mxm_wmi
[255001.629635]  tls scsi_transport_sas
[255001.632353]  cxgb4
[255001.635167]  wmi
[255001.638856]  drm
[255001.641762] 
[255001.641763] CPU: 3 PID: 16380 Comm: iou-wrk-15349 Tainted: G      D      L    5.15.0-rc2+ #1
[255001.644960]  mpt3sas
[255001.647771] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.12.1 12/07/2020
[255001.650578]  tg3
[255001.652897] RIP: 0010:native_queued_spin_lock_slowpath+0x193/0x1e0
[255001.655804]  ahci
[255001.658616] Code: eb e8 c1 ee 12 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 c0 9c 02 00 48 03 04 f5 80 da 09 a3 48 89 10f
[255001.661426]  raid_class
[255001.664328] RSP: 0018:ffffab80c6b87b98 EFLAGS: 00000246
[255001.666656]  libahci
[255001.669177] 
[255001.669177] RAX: 0000000000000000 RBX: ffff98ebd596a000 RCX: 0000000000100000
[255001.671990]  tls
[255001.675186] RDX: ffff992a3f6e9cc0 RSI: 0000000000000019 RDI: ffff98ebd596a480
[255001.678092]  scsi_transport_sas
[255001.680707] RBP: ffffab80c6b87b98 R08: 0000000000100000 R09: 0000000000000030
[255001.683226]  wmi
[255001.686813] R10: ffff98ef20839e00 R11: 0000000000000000 R12: ffff98f24bb6c300
[255001.690011] 
[255001.693984] R13: ffff98ebd596a480 R14: 0000000000061a80 R15: 0000000000000000
[255001.696603] CPU: 1 PID: 16361 Comm: iou-wrk-15349 Tainted: G      D      L    5.15.0-rc2+ #1
[255001.699998] FS:  000014a219a25b80(0000) GS:ffff992a3f6c0000(0000) knlGS:0000000000000000
[255001.702811] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.12.1 12/07/2020
[255001.705916] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[255001.708630] RIP: 0010:native_queued_spin_lock_slowpath+0x198/0x1e0
[255001.711734] CR2: 0000557a30960000 CR3: 0000001c9b7d6003 CR4: 00000000003706e0
[255001.714452] Code: 83 e0 03 83 ee 01 48 c1 e0 05 48 63 f6 48 05 c0 9c 02 00 48 03 04 f5 80 da 09 a3 48 89 10 8b 42 08 85 c05
[255001.717360] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[255001.720561] RSP: 0000:ffffab80c6aefb98 EFLAGS: 00000246
[255001.723662] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[255001.727058] 
[255001.730163] Call Trace:
[255001.730165]  _raw_spin_lock+0x1f/0x30
[255001.733270] RAX: 0000000000000000 RBX: ffff98ebd596a000 RCX: 0000000000080000
[255001.735989]  io_req_complete_post+0x32/0x350
[255001.739097] RDX: ffff992a3f669cc0 RSI: 0000000000000015 RDI: ffff98ebd596a480
[255001.742006]  kiocb_done+0x270/0x2a0
[255001.744918] RBP: ffffab80c6aefb98 R08: 0000000000080000 R09: 0000000000000001
[255001.747640]  io_read+0x166/0x520
[255001.750749] R10: ffffab80c6aef7a8 R11: 0000000000000000 R12: ffff98f5aa4ad600
[255001.753556]  ? update_cfs_group+0x9f/0xc0
[255001.755889] R13: ffff98ebd596a480 R14: 0000000000008000 R15: 0000000000000000
[255001.758903]  ? dequeue_entity+0xe8/0x400
[255001.761634] FS:  000014a219a25b80(0000) GS:ffff992a3f640000(0000) knlGS:0000000000000000
[255001.764351]  io_issue_sqe+0xb49/0x1780
[255001.767065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[255001.771237]  ? finish_task_switch+0xaa/0x290
[255001.773897] CR2: 0000151c28aec000 CR3: 0000001c9b7d6005 CR4: 00000000003706e0
[255001.776220]  ? io_issue_sqe+0xb49/0x1780
[255001.778550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[255001.780486]  ? lock_timer_base+0x6b/0x90
[255001.790207] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[255001.792943]  io_wq_submit_work+0x5c/0x100
[255001.801716] Call Trace:
[255001.804085]  io_worker_handle_work+0x22e/0x430
[255001.811318]  _raw_spin_lock+0x1f/0x30
[255001.813804]  io_wqe_worker+0x103/0x370
[255001.835437]  io_req_complete_post+0x32/0x350
[255001.838556]  ? finish_task_switch+0xaa/0x290
[255001.844796]  kiocb_done+0x270/0x2a0
[255001.847638]  ? recalc_sigpending+0x1b/0x50
[255001.849707]  io_write+0x22a/0x360
[255001.858094]  ? io_worker_handle_work+0x430/0x430
[255001.860572]  ? update_cfs_group+0x9f/0xc0
[255001.868982]  ret_from_fork+0x1f/0x30
[255001.872950]  ? dequeue_entity+0xe8/0x400
[255001.881387] RIP: 0033:0x0
[255001.883901]  io_issue_sqe+0x933/0x1780
[255001.892350] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[255001.894498]  ? finish_task_switch+0xaa/0x290
[255001.902965] RSP: 002b:0000000000000000 EFLAGS: 00000246
[255001.912905]  ? io_issue_sqe+0x933/0x1780
[255001.922476]  ORIG_RAX: 00000000000001aa
[255001.931467]  ? lock_timer_base+0x6b/0x90
[255001.938432] RAX: 0000000000000000 RBX: 00000000000001ce RCX: 000014a2183ca959
[255001.945893]  io_wq_submit_work+0x5c/0x100
[255001.954430] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000005
[255001.976506]  io_worker_handle_work+0x22e/0x430
[255001.985093] RBP: 00007ffeaf74e210 R08: 0000000000000000 R09: 0000000000000008
[255001.991556]  io_wqe_worker+0x103/0x370
[255002.000159] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffeaf74e1d0
[255002.002460]  ? finish_task_switch+0xaa/0x290
[255002.005831] R13: 0000000000000000 R14: 000055a331eebe00 R15: 000055a38f34e010
[255002.010567]  ? recalc_sigpending+0x1b/0x50
[255002.325865]  ? io_worker_handle_work+0x430/0x430
[255002.331665]  ret_from_fork+0x1f/0x30
[255002.336296] RIP: 0033:0x0
[255002.339847] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[255002.348158] RSP: 002b:0000000000000000 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
[255002.357247] RAX: 0000000000000000 RBX: 000000000000010a RCX: 000014a2183ca959
[255002.365856] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000005
[255002.374438] RBP: 00007ffeaf74e210 R08: 0000000000000000 R09: 0000000000000008
[255002.383020] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffeaf74e1d0
[255002.391598] R13: 0000000000000000 R14: 000055a331eebe00 R15: 000055a3929ae270

@johalun
Copy link
Author

johalun commented Oct 8, 2021

Hmm, after a clean rebuild of my app it seems to work. Could have been just a temporary weirdness with the macro having wrong value or something. I'll run some more tests.
Edit: Nope, the second run it hung again...

@victorstewart
Copy link

victorstewart commented Oct 8, 2021

so for networked reads, IOSQE_ASYNC_HYBRID will always deliver better performance?

might it make sense to default to it inside the kernel then? unless the user requests sync or regular async?

though why would they ever? maybe it makes sense to disable guarantee performance losing choices for certain ops so users can't shoot themselves in the foot?

and for Direct IO reads, is IOSQE_FORCE_ASYNC the same best always choice? same logic as above might apply. why give users known worse option choices?

@HowHsu
Copy link

HowHsu commented Oct 9, 2021

Hmm, after a clean rebuild of my app it seems to work. Could have been just a temporary weirdness with the macro having wrong value or something. I'll run some more tests. Edit: Nope, the second run it hung again...

Hi Johannes, I'm just back from holiday, sorry for the delay. Hmm, could you try the latest version of async hybrid on top of
for-5.16/io_uring
https://lore.kernel.org/io-uring/20211008123642.229338-1-haoxu@linux.alibaba.com/T/#t

@HowHsu
Copy link

HowHsu commented Oct 9, 2021

so for networked reads, IOSQE_ASYNC_HYBRID will always deliver better performance?

IMHO, no. It really depends on how many cpu resources users want to use. Suppose we are doing socket reading work, each work is
heavy(reading a lot of data), in ASYNC_HYBRID mode we the actual reading is done in the original context, which means we leverage
less cpu resources while in ASYNC mode reading is done in the io-worker context which means we leverage more cpu.
So if cpu resources are adequate, ASYNC should be better than HYBRID.
The cpu cost of non-async, force-async, async-hybrid is: non-async < async-hybrid < force-async

might it make sense to default to it inside the kernel then? unless the user requests sync or regular async?

though why would they ever? maybe it makes sense to disable guarantee performance losing choices for certain ops so users can't shoot themselves in the foot?

and for Direct IO reads, is IOSQE_FORCE_ASYNC the same best always choice? same logic as above might apply. why give users known worse option choices?

Similar answer for this one, if you have scarce cpu resources, doing it in force-async way under high IO pressure results in frequent
context switching. To avoid this you may set max_worker to nr_cpu or 2*nr_cpu, but that causes request delay since it's eazily to
reach max_worker and then all io-workers are buzy doing heavy works, other works wait there and latency increases..

@HowHsu
Copy link

HowHsu commented Oct 21, 2021

Hmm, after a clean rebuild of my app it seems to work. Could have been just a temporary weirdness with the macro having wrong value or something. I'll run some more tests. Edit: Nope, the second run it hung again...

https://lore.kernel.org/io-uring/20211018133445.103438-1-haoxu@linux.alibaba.com/T/#u
Hi Johannes, could you try this one when you are available, this is the newest version of async hybrid, no need to set IOSQE_ASYNC_HYBRID, just use IOSQE_ASYNC to enable it, hope no hung any more..

@johalun
Copy link
Author

johalun commented Oct 22, 2021

Yep this seems stable so far. Consistent performance and no hung cpus 👍🏻

@HowHsu
Copy link

HowHsu commented Nov 29, 2021

I'm guessing it ends up in the unbounded queue and needs to wait on someone else because you've hit your limit of unbounded workers?

Yep, that seems to be the case. Increasing unbounded workers improved performance a lot! This is with latest patch for hybrid async which is enabled for all I/O, except splice. IOSQE_ASYNC, IOSQE_ASYNC_HYBRID or no flag makes no difference for splice.

How does the sendfile(2) implementation handle this? Does it spawn a new thread for each splice call? It seems at least with uring right now we need one unbound worker per concurrent splice if we don't want them waiting on each other.

As far as I know, sendfile() doesn't create any new thread to do its job. It does the job all in the original context. The io_uring
splice is not fast as expected, combining two splices as a sendfile makes it worse..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants