Skip to content

io_uring: switch server-socket read path to multishot recv#44669

Draft
aburan28 wants to merge 4 commits into
envoyproxy:mainfrom
aburan28:multishot-recv/03-server-socket
Draft

io_uring: switch server-socket read path to multishot recv#44669
aburan28 wants to merge 4 commits into
envoyproxy:mainfrom
aburan28:multishot-recv/03-server-socket

Conversation

@aburan28
Copy link
Copy Markdown

@aburan28 aburan28 commented Apr 27, 2026

Commit Message:
io_uring: switch server-socket read path to multishot recv

Switches IoUringServerSocket's read path from a per-read readv +
heap-allocated uint8_t[] to a single multishot IORING_OP_RECV SQE that
pulls buffers from a kernel-managed buf-ring. Each completion hands the
upper layer one kernel-selected buffer wrapped in a BufferFragmentImpl
whose release callback recycles it back to the ring.

The current readv path has two costs on every read: a
make_unique<uint8_t[]>(size) (and a paired delete[]), and a fresh SQE
submission. With multishot + provided buffers the kernel keeps the SQE armed
across many recv completions and pulls from a pre-registered buffer pool —
eliminating both costs on the hot path.

Mechanics:

  • New Request::RequestType::RecvMultishot distinguishes the multishot SQE
    so the worker's completion dispatch knows to keep the Request* alive
    while IORING_CQE_F_MORE is set (the kernel reuses the same user_data for
    further completions).
  • IoUringSocket::onRead virtual gains a uint32_t flags argument carrying
    the raw cqe->flags. The buffer ID is in the upper bits when
    IORING_CQE_F_BUFFER is set; F_MORE indicates the SQE is still armed.
  • IoUringServerSocket::onRead only clears read_req_ when the SQE has
    terminated. While armed the bottom-of-function submitReadRequest
    short-circuits because read_req_ is still non-null. When F_MORE
    clears, read_req_ is freed and a new multishot SQE is submitted.
  • IoUringWorkerImpl::makeMultishotBufferFragment wraps the kernel buffer
    with a release callback that calls recycleBuffer — buffer return is
    driven by the upper-layer drain.

Depends on:

Additional Description:
The worker constructor gains two new defaulted args
(enable_multishot_recv = false, multishot_recv_buffer_count = 256) so all
existing call sites compile unchanged. When multishot is requested but the
kernel/liburing doesn't support it, setupBufRing returns Failed and the
worker silently falls back to readv.

The proto config + factory wiring to actually expose this option lives in
#44670. Until that lands, the readv path is the only one used in production.

AI usage disclosure: Portions of the code and/or PR description were drafted
with the assistance of Claude (Anthropic). I reviewed and understand all
submitted code.

Risk Level: Medium
(Materially different read path for io_uring server sockets, but defaulted
off via worker constructor args and not yet reachable from proto config.
With the default, no behavior changes.)

Testing:

  • MultishotRecvSetupAndSubmit — buf-ring setup + first submit picks the
    multishot path and produces a RecvMultishot request.
  • MultishotRecvFallbackOnUnsupportedKernel — when setupBufRing fails,
    the worker falls back to prepareReadv.
  • MultishotRecvDeliversBufferAndStaysArmed — completion with
    F_BUFFER | F_MORE delivers the buffer and does not re-arm.
  • MultishotRecvReArmOnFMoreClear — completion with F_BUFFER but no
    F_MORE triggers a fresh prepareRecvMultishot.
  • Existing io_uring unit and integration tests on Linux CI.

Docs Changes: N/A. New onRead flags arg is documented inline in
envoy/common/io/io_uring.h.

Release Notes: N/A (read path change is unreachable from configuration until
#44670 lands; the public-facing release note belongs there.)

Platform Specific Features:
io_uring is Linux-only. Multishot recv requires kernel 6.0+; on older
kernels the worker falls back to readv via the setupBufRing failure path.
No platform support change beyond existing io_uring build gating.

Runtime guard: N/A in this PR — the new path is gated by a constructor arg
that defaults to off and is unreachable from configuration. The config-level
gate (and any release-note runtime guard) lives in #44670.

This is a no-behavior-change preparation step for multishot recv. The
``CompletionCb`` callback type now takes a ``uint32_t flags`` argument
that carries the raw ``cqe->flags`` value from the kernel.

For multishot completions a follow-up change will inspect:
* ``IORING_CQE_F_BUFFER`` — a buffer was selected from a buf-ring; the
  buffer ID is encoded in the upper bits.
* ``IORING_CQE_F_MORE`` — the SQE will produce further completions.

The worker callback ignores ``flags`` for now. Injected completions are
defined to always carry ``flags == 0``.

All ``forEveryCompletion`` callers (worker, impl tests) updated.
``IoUringSocket::on*`` virtual methods are intentionally unchanged in
this commit; only ``onRead`` will need flags, in the multishot recv
change.

Signed-off-by: Adam Buran <a.buran28@gmail.com>
Signed-off-by: Adam Buran <aburan28@gmail.com>
Adds the kernel-managed buffer ring lifecycle and the ``recv`` multishot
opcode to ``IoUringImpl``. This is the plumbing layer for switching the
io_uring socket read path off the per-read ``readv`` allocation; the
worker change comes in a follow-up PR.

New ``IoUring`` virtuals:

* ``setupBufRing(group_id, count, buf_size)`` — register a buffer ring
  with the kernel. The buffers live in a single contiguous allocation
  owned by ``IoUringImpl``. Validates that ``count`` is a non-zero power
  of two and rejects double-setup. Falls back to ``IoUringResult::Failed``
  on kernels that lack ``IORING_REGISTER_PBUF_RING`` (< 5.19).
* ``prepareRecvMultishot(fd, group_id, user_data)`` — submits a recv
  with ``IOSQE_BUFFER_SELECT`` so the kernel pulls a buffer from the
  ring. The same SQE may produce multiple completions, signalled by
  ``IORING_CQE_F_MORE`` in ``cqe->flags``.
* ``getBufferForBid(group_id, bid)`` — look up the storage backing a
  particular kernel-selected buffer; the consumer reads up to ``cqe->res``
  bytes and then recycles.
* ``recycleBuffer(group_id, bid)`` — return a consumed buffer to the
  ring so the kernel can reuse it.

For now only one buf-ring is supported per ``IoUring`` instance.

Test:
* ``SetupBufRingValidatesInputs`` — exercises the rejection paths
  (bad count, bad buf_size, double-setup).
* ``MultishotRecvDeliversBuffersAndStaysArmed`` — end-to-end with a real
  socketpair and a real ring: arm a multishot recv, write twice,
  verify both completions deliver buffers, the bid is in range, the
  data matches, and the SQE stays armed (F_MORE set on the first
  completion). Skips when the kernel lacks buf-ring support.

Signed-off-by: Adam Buran <a.buran28@gmail.com>
Signed-off-by: Adam Buran <aburan28@gmail.com>
When the worker is configured with multishot recv enabled and the
kernel/liburing successfully sets up a buf-ring (5.19+), the
``IoUringServerSocket`` read path replaces the per-read
``readv`` SQE + ``uint8_t[]`` allocation with a single
``IORING_OP_RECV`` multishot SQE that pulls buffers from the kernel-
managed ring. Each completion delivers one kernel-selected buffer; the
``BufferFragment`` wrapping it recycles the buffer back to the ring on
release.

Mechanics:

* New ``Request::RequestType::RecvMultishot`` distinguishes the
  multishot SQE from a plain ``Read``. The worker's completion dispatch
  routes both to ``onRead`` but holds onto the ``Request*`` while
  ``IORING_CQE_F_MORE`` is set (the kernel reuses the same user_data
  for further completions on the same SQE).
* ``IoUringSocket::onRead`` gains a ``uint32_t flags`` argument carrying
  the raw ``cqe->flags``. The buffer ID is in the upper bits when
  ``IORING_CQE_F_BUFFER`` is set; ``F_MORE`` indicates the SQE is still
  armed.
* ``IoUringServerSocket::onRead`` only clears ``read_req_`` when the
  SQE has terminated. While armed, the bottom-of-function
  ``submitReadRequest`` short-circuits because ``read_req_`` is still
  non-null. When ``F_MORE`` clears, ``read_req_`` is freed and a new
  multishot SQE is submitted.
* ``IoUringWorkerImpl::makeMultishotBufferFragment`` wraps the kernel
  buffer with a release callback that calls ``recycleBuffer`` —
  back-pressure / buffer return is driven by the upper-layer drain.
* On older kernels ``setupBufRing`` returns ``Failed`` and the worker
  silently falls back to the existing ``readv`` path, so the feature
  is safe to ship gated behind a config flag.

The worker constructor gains two new defaulted args
(``enable_multishot_recv``, ``multishot_recv_buffer_count``) so all
existing call sites continue to compile unchanged.

Tests:

* ``MultishotRecvSetupAndSubmit`` — buf-ring setup + first submit picks
  the multishot path and produces a ``RecvMultishot`` request.
* ``MultishotRecvFallbackOnUnsupportedKernel`` — when ``setupBufRing``
  fails, the worker falls back to ``prepareReadv``.
* ``MultishotRecvDeliversBufferAndStaysArmed`` — completion with
  ``F_BUFFER | F_MORE`` delivers the buffer to the upper layer and does
  not re-arm the SQE; the buffer is recycled when the upper layer
  drains.
* ``MultishotRecvReArmOnFMoreClear`` — completion with ``F_BUFFER``
  but no ``F_MORE`` triggers a fresh ``prepareRecvMultishot`` to re-arm.

The proto / factory wiring to actually expose this option is in a
follow-up change.

Signed-off-by: Adam Buran <a.buran28@gmail.com>
Signed-off-by: Adam Buran <aburan28@gmail.com>
@repokitteh-read-only
Copy link
Copy Markdown

Hi @aburan28, welcome and thank you for your contribution.

We will try to review your Pull Request as quickly as possible.

In the meantime, please take a look at the contribution guidelines if you have not done so already.

🐱

Caused by: #44669 was opened by aburan28.

see: more, trace.

@aburan28 aburan28 had a problem deploying to external-contributors April 27, 2026 03:12 — with GitHub Actions Error
@repokitteh-read-only
Copy link
Copy Markdown

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #44669 was opened by aburan28.

see: more, trace.

@aburan28 aburan28 marked this pull request as ready for review May 4, 2026 22:58
@zuercher
Copy link
Copy Markdown
Member

zuercher commented May 5, 2026

Let's mark this as a draft until the PR it depends on is merged.

Signed-off-by: Adam Buran <aburan28@gmail.com>
@aburan28 aburan28 requested a deployment to external-contributors May 10, 2026 22:15 — with GitHub Actions Waiting
@aburan28 aburan28 marked this pull request as draft May 10, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants