[SPARK-XXXXX][CORE] Add IO_URING transport mode#55880

Draft

LuciferYang wants to merge 12 commits into

apache:masterfrom

LuciferYang:iouring-transport

Contributor

LuciferYang commented May 14, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

LuciferYang added 3 commits

May 14, 2026 20:01


          [SPARK-XXXXX][CORE] Add IO_URING transport mode for Linux 5.10+

0b21463

### What changes were proposed in this pull request?

Enable Netty's io_uring native transport in Spark by:

1. Removing the `netty-transport-classes-io_uring` and
   `netty-transport-native-io_uring` exclusions from `netty-all` in the
   root `pom.xml`, and adding explicit `linux-x86_64` / `linux-aarch_64`
   classifier dependencies under `dependencyManagement`.
2. Mirroring the same native classifier dependencies in
   `common/network-common/pom.xml` and `core/pom.xml`.
3. Adding `IO_URING` to `IOMode` and wiring it into
   `NettyUtils.createEventLoop` / `getClientChannelClass` /
   `getServerChannelClass`. In `AUTO`, io_uring is preferred on Linux
   when `IoUring.isAvailable()` reports the running kernel supports it,
   then EPOLL, then KQUEUE on macOS, then NIO.
4. Adding `ShuffleNettyIoUringSuite` (gated on
   `Utils.isLinux && IoUring.isAvailable`) so the existing shuffle
   coverage exercises the new mode where the platform supports it.
5. Refreshing `NettyTransportBenchmark` comments so the AUTO behavior
   change is visible at the call sites; the existing `NIO vs AUTO`
   suites automatically exercise io_uring on Linux 5.10+.
6. Regenerating `dev/deps/spark-deps-hadoop-3-hive-2.3` to include the
   new `netty-transport-classes-io_uring` and
   `netty-transport-native-io_uring` (linux-x86_64 / linux-aarch_64 /
   linux-riscv64) entries.

### Why are the changes needed?

io_uring graduated from incubator to a first-class transport in Netty
4.2 (`io.netty.channel.uring`). Compared to EPOLL it batches I/O
operations through submission/completion queues, reducing per-op syscall
overhead on busy executors, and uses `IORING_OP_SPLICE` for `FileRegion`
writes -- functionally equivalent to `sendfile()` but fully
asynchronous. SPARK-56279 already updated `MessageEncoder` to emit the
header `ByteBuf` and the bare `DefaultFileRegion` separately when the
body is a `FileSegmentManagedBuffer`, which means the io_uring write
path can recognize the `DefaultFileRegion` and apply splice without any
additional Spark-side change.

### Does this PR introduce _any_ user-facing change?

Yes. On Linux kernels 5.10+, `spark.shuffle.io.mode=AUTO` (the default)
now selects io_uring instead of EPOLL when io_uring is available.
Operators who want the previous behavior can set
`spark.shuffle.io.mode=EPOLL` explicitly. A new explicit
`IO_URING` mode is also available.

### How was this patch tested?

- Manual SBT compile of `network-common`, `core`, `core/Test`,
  `network-shuffle`, and (with `-Pyarn`) `network-yarn` on macOS.
- `ShuffleNettyIoUringSuite` mirrors `ShuffleNettyEpollSuite` and runs
  the existing `ShuffleSuite` cases under `IO_URING` on Linux 5.10+ via
  GitHub Actions.
- macOS runs continue to take the KQUEUE path; Linux runs without
  io_uring kernel support fall back to EPOLL.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.x


          [SPARK-XXXXX][CORE][FOLLOWUP] Hide io_uring availability check behind…

a03118b

… NettyUtils helper

Replace the `_root_.io.netty.channel.uring.IoUring.isAvailable` reference
in `ShuffleNettyIoUringSuite` with a small `NettyUtils.isIoUringAvailable()`
helper. The `_root_.` prefix was needed because `org.apache.spark.io`
shadows `io.netty.*` in this file's package, but it reads as unusual.
The helper keeps the Netty-specific class out of the test scope.


          [SPARK-XXXXX][CORE][FOLLOWUP] Shade io_uring native libs in YARN shuf…

9e65d44

…fle service

Add antrun `move` rules so the YARN external shuffle service relocates
the io_uring native libraries alongside the existing epoll/kqueue/tcnative
ones. Without this, the shaded `org.sparkproject.io.netty` classes look
for `liborg_sparkproject_netty_transport_native_io_uring42_*.so`, but
the unshaded files are named `libnetty_transport_native_io_uring42_*.so`,
so `IoUring.isAvailable()` returns `false` inside the YARN shuffle
service JVM and io_uring is silently unused.

Note: Netty 4.2 names the io_uring native lib `io_uring42_<arch>` (with
the major+minor version suffix to allow multiple Netty versions to
coexist), unlike epoll which uses the unsuffixed `epoll_<arch>`.

Verified with `build/mvn -pl common/network-yarn -am -Pyarn -DskipTests
package`: the resulting `spark-*-yarn-shuffle.jar` contains
`META-INF/native/liborg_sparkproject_netty_transport_native_io_uring42_{x86_64,aarch_64,riscv64}.so`.

LuciferYang marked this pull request as draft

May 14, 2026 12:19

LuciferYang added 5 commits

May 14, 2026 23:32


          [SPARK-XXXXX][CORE][FOLLOWUP] Probe io_uring ring allocation for AUTO…

784c0a8

… fallback

`IoUring.isAvailable()` only verifies that the JNI library loaded and
the basic syscalls work; it does not detect environments where the
kernel supports io_uring but `RLIMIT_MEMLOCK` is too low to actually
allocate the submission/completion queue rings. This is common in
containers, GitHub Actions runners, and other restricted environments,
and surfaces as:

    java.lang.IllegalStateException: failed to create a child event loop
    Caused by: java.lang.RuntimeException: failed to allocate memory for
      io_uring ring; try raising memlock limit (see getrlimit(RLIMIT_MEMLOCK,
      ...) or ulimit -l): Cannot allocate memory
        at io.netty.channel.uring.IoUringIoHandler.<init>(...)

After SPARK-XXXXX (the parent change) made AUTO prefer io_uring on
Linux, this caused unconditional failures in such environments rather
than graceful fallback to EPOLL.

This change adds a one-time JVM-wide probe in `NettyUtils` that creates
a single-thread `MultiThreadIoEventLoopGroup` with the io_uring handler
factory and shuts it down. If construction throws, the result is cached
as `false` and AUTO falls back to EPOLL. The probe is consulted by AUTO
mode and by `ShuffleNettyIoUringSuite.shouldRunTests`. An explicit
`IOMode.IO_URING` does not consult the probe and surfaces the
underlying error so users see what's wrong.

The previous `isIoUringAvailable()` helper (which just delegated to
`IoUring.isAvailable()`) is replaced by `isIoUringUsable()`, which
returns the probed result.


          [SPARK-XXXXX][INFRA][FOLLOWUP] Raise RLIMIT_MEMLOCK in CI so io_uring…

aecb9a1

… is exercised

The probe-based fallback added by the previous follow-up makes Spark
gracefully degrade to EPOLL when AUTO cannot allocate an io_uring ring
(low `RLIMIT_MEMLOCK`, common in containers and GitHub Actions runners).
Without raising the limit in CI, the io_uring code path would be
silently skipped on every PR and never exercised before release.

Add `sudo prlimit --pid $$ --memlock=unlimited:unlimited` (Linux-only,
fail-soft via `2>/dev/null || true` so it's a no-op on macOS/Windows
runners) at the top of:

  - `.github/workflows/build_and_test.yml` "Run tests" step, so module
    builds (yarn, core, network-shuffle, mllib, etc.) that hit
    `IOMode.AUTO` actually use io_uring on Linux 5.10+ and
    `ShuffleNettyIoUringSuite` runs instead of skipping via
    `NettyUtils.isIoUringUsable`.
  - `.github/workflows/benchmark.yml` "Run benchmarks" step, so
    `NettyTransportBenchmark`'s NIO-vs-AUTO comparison and
    file-backed shuffle suite measure io_uring rather than EPOLL.

The fail-soft is important: stock GHA Linux runners support sudo
prlimit, but stripped-down environments (e.g., custom containers used
by some matrix jobs) might not, and we don't want the CI step to fail
just because memlock could not be raised. The probe in `NettyUtils`
will then degrade to EPOLL as designed.


          [SPARK-XXXXX][INFRA][FOLLOWUP] Restore trailing whitespace on build_a…

…nd_test.yml line 417

Inadvertently stripped by the previous CI commit's surrounding edit.
Pure whitespace; no behavioral change.


          Merge branch 'apache:master' into iouring-transport

7e3fbf0


          [SPARK-XXXXX][CORE][FOLLOWUP] Probe io_uring with worst-case thread c…

aaec23e

…ount, not 1

The previous probe created a one-thread MultiThreadIoEventLoopGroup to
verify io_uring ring allocation works. This is insufficient in
environments (e.g., GHA Docker container jobs for pyspark) where the
container's RLIMIT_MEMLOCK is just large enough for one io_uring ring
but not the eight rings Spark allocates by default per event loop
group. The probe would succeed, AUTO would pick io_uring, then
TransportServer.init -> createEventLoop(numThreads=8) would crash
with `failed to allocate memory for io_uring ring` and propagate the
exception out of SparkContext construction.

Probe with MAX_DEFAULT_NETTY_THREADS rings instead. This matches the
worst-case allocation size Spark uses by default for a single event
loop group, so any environment whose memlock can't support real Spark
usage now correctly falls back to EPOLL at probe time.

Users who explicitly raise spark.shuffle.io.serverThreads (or the
analogous client/chunk-fetch knobs) above MAX_DEFAULT_NETTY_THREADS
remain responsible for ensuring their environment can support the
larger ring count; otherwise they should set spark.shuffle.io.mode
to EPOLL explicitly.

Observed in the pyspark CI matrix where runs sit inside a Docker
container that does not honor `sudo prlimit --memlock=unlimited` from
the workflow shell, leaving the JVM with the container's default
memlock.

Contributor Author

LuciferYang commented May 19, 2026

There are several issues in this netty version, which need to be further resolved in the next release.

LuciferYang and others added 4 commits

May 20, 2026 16:52


          [SPARK-XXXXX][CORE][FOLLOWUP] Do not auto-select io_uring; keep IO_UR…

b9fe8d5

…ING opt-in only

io_uring has known incompatibilities with some Spark configurations -- most
notably `spark.authenticate.enableSaslEncryption=true`, where
`SaslEncryption.EncryptedMessage` violates the FileRegion `count()` upper-bound
assumption that io_uring's chunked-FileRegion fallback relies on. AUTO-selecting
io_uring would silently expose every such workload to corruption.

Restore AUTO to its prior behavior (EPOLL on Linux, KQUEUE on macOS, NIO
fallback). Users who want io_uring opt in explicitly via
`spark.shuffle.io.mode=IO_URING`. `isIoUringUsable()` is retained as a probe
helper for tests; AUTO no longer consults it.

Also update IOMode.AUTO Javadoc and NettyTransportBenchmark comments to match
the new behavior.


          [SPARK-XXXXX][CORE][FOLLOWUP] Add IO_URING to NettyTransportBenchmark…

bc0be07

… mode comparisons

Extend Suite 3 (IOMode Comparison) and Suite 8 (File-Backed Shuffle Block Fetch)
to also run an IO_URING case when NettyUtils.isIoUringUsable reports true. The
IO_URING case skips on macOS, on Linux without io_uring support, and on CI
runners with low RLIMIT_MEMLOCK; recorded results may therefore omit it, and
io_uring numbers should be captured by running locally on Linux 5.10+ with
prlimit --memlock=unlimited.


          Benchmark results for org.apache.spark.network.NettyTransportBenchmar…

f0e1f15

…k (JDK 17, Scala 2.13, split 1 of 1)


          [SPARK-XXXXX][CORE][FOLLOWUP] Log io_uring capabilities and per-modul…

da25ffe

…e IO mode for diagnostics

Adds two one-shot info logs to help triage io_uring vs EPOLL performance reports,
particularly the shuffle-fetch regression visible in the NettyTransportBenchmark
results committed in f0e1f15:

* TransportContext ctor logs `(module, ioMode, role)` once per context so it is
  visible which Spark module ends up on IO_URING vs EPOLL vs NIO.

* NettyUtils.createEventLoop logs IoUring.featureString(), the kernel version,
  and /proc/sys/fs/pipe-max-size the first time IO_URING is selected in a process.
  The pipe-max-size matters because Netty's IoUringFileRegion routes FileRegion
  sends through a splice(2) pipe (file -> pipe -> socket); a small pipe forces
  more SQE/CQE round-trips per shuffle chunk.

No behavior change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet