[SPARK-XXXXX][CORE] Add IO_URING transport mode#55880
Draft
LuciferYang wants to merge 12 commits into
Draft
Conversation
### What changes were proposed in this pull request? Enable Netty's io_uring native transport in Spark by: 1. Removing the `netty-transport-classes-io_uring` and `netty-transport-native-io_uring` exclusions from `netty-all` in the root `pom.xml`, and adding explicit `linux-x86_64` / `linux-aarch_64` classifier dependencies under `dependencyManagement`. 2. Mirroring the same native classifier dependencies in `common/network-common/pom.xml` and `core/pom.xml`. 3. Adding `IO_URING` to `IOMode` and wiring it into `NettyUtils.createEventLoop` / `getClientChannelClass` / `getServerChannelClass`. In `AUTO`, io_uring is preferred on Linux when `IoUring.isAvailable()` reports the running kernel supports it, then EPOLL, then KQUEUE on macOS, then NIO. 4. Adding `ShuffleNettyIoUringSuite` (gated on `Utils.isLinux && IoUring.isAvailable`) so the existing shuffle coverage exercises the new mode where the platform supports it. 5. Refreshing `NettyTransportBenchmark` comments so the AUTO behavior change is visible at the call sites; the existing `NIO vs AUTO` suites automatically exercise io_uring on Linux 5.10+. 6. Regenerating `dev/deps/spark-deps-hadoop-3-hive-2.3` to include the new `netty-transport-classes-io_uring` and `netty-transport-native-io_uring` (linux-x86_64 / linux-aarch_64 / linux-riscv64) entries. ### Why are the changes needed? io_uring graduated from incubator to a first-class transport in Netty 4.2 (`io.netty.channel.uring`). Compared to EPOLL it batches I/O operations through submission/completion queues, reducing per-op syscall overhead on busy executors, and uses `IORING_OP_SPLICE` for `FileRegion` writes -- functionally equivalent to `sendfile()` but fully asynchronous. SPARK-56279 already updated `MessageEncoder` to emit the header `ByteBuf` and the bare `DefaultFileRegion` separately when the body is a `FileSegmentManagedBuffer`, which means the io_uring write path can recognize the `DefaultFileRegion` and apply splice without any additional Spark-side change. ### Does this PR introduce _any_ user-facing change? Yes. On Linux kernels 5.10+, `spark.shuffle.io.mode=AUTO` (the default) now selects io_uring instead of EPOLL when io_uring is available. Operators who want the previous behavior can set `spark.shuffle.io.mode=EPOLL` explicitly. A new explicit `IO_URING` mode is also available. ### How was this patch tested? - Manual SBT compile of `network-common`, `core`, `core/Test`, `network-shuffle`, and (with `-Pyarn`) `network-yarn` on macOS. - `ShuffleNettyIoUringSuite` mirrors `ShuffleNettyEpollSuite` and runs the existing `ShuffleSuite` cases under `IO_URING` on Linux 5.10+ via GitHub Actions. - macOS runs continue to take the KQUEUE path; Linux runs without io_uring kernel support fall back to EPOLL. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.x
… NettyUtils helper Replace the `_root_.io.netty.channel.uring.IoUring.isAvailable` reference in `ShuffleNettyIoUringSuite` with a small `NettyUtils.isIoUringAvailable()` helper. The `_root_.` prefix was needed because `org.apache.spark.io` shadows `io.netty.*` in this file's package, but it reads as unusual. The helper keeps the Netty-specific class out of the test scope.
…fle service
Add antrun `move` rules so the YARN external shuffle service relocates
the io_uring native libraries alongside the existing epoll/kqueue/tcnative
ones. Without this, the shaded `org.sparkproject.io.netty` classes look
for `liborg_sparkproject_netty_transport_native_io_uring42_*.so`, but
the unshaded files are named `libnetty_transport_native_io_uring42_*.so`,
so `IoUring.isAvailable()` returns `false` inside the YARN shuffle
service JVM and io_uring is silently unused.
Note: Netty 4.2 names the io_uring native lib `io_uring42_<arch>` (with
the major+minor version suffix to allow multiple Netty versions to
coexist), unlike epoll which uses the unsuffixed `epoll_<arch>`.
Verified with `build/mvn -pl common/network-yarn -am -Pyarn -DskipTests
package`: the resulting `spark-*-yarn-shuffle.jar` contains
`META-INF/native/liborg_sparkproject_netty_transport_native_io_uring42_{x86_64,aarch_64,riscv64}.so`.
… fallback
`IoUring.isAvailable()` only verifies that the JNI library loaded and
the basic syscalls work; it does not detect environments where the
kernel supports io_uring but `RLIMIT_MEMLOCK` is too low to actually
allocate the submission/completion queue rings. This is common in
containers, GitHub Actions runners, and other restricted environments,
and surfaces as:
java.lang.IllegalStateException: failed to create a child event loop
Caused by: java.lang.RuntimeException: failed to allocate memory for
io_uring ring; try raising memlock limit (see getrlimit(RLIMIT_MEMLOCK,
...) or ulimit -l): Cannot allocate memory
at io.netty.channel.uring.IoUringIoHandler.<init>(...)
After SPARK-XXXXX (the parent change) made AUTO prefer io_uring on
Linux, this caused unconditional failures in such environments rather
than graceful fallback to EPOLL.
This change adds a one-time JVM-wide probe in `NettyUtils` that creates
a single-thread `MultiThreadIoEventLoopGroup` with the io_uring handler
factory and shuts it down. If construction throws, the result is cached
as `false` and AUTO falls back to EPOLL. The probe is consulted by AUTO
mode and by `ShuffleNettyIoUringSuite.shouldRunTests`. An explicit
`IOMode.IO_URING` does not consult the probe and surfaces the
underlying error so users see what's wrong.
The previous `isIoUringAvailable()` helper (which just delegated to
`IoUring.isAvailable()`) is replaced by `isIoUringUsable()`, which
returns the probed result.
… is exercised
The probe-based fallback added by the previous follow-up makes Spark
gracefully degrade to EPOLL when AUTO cannot allocate an io_uring ring
(low `RLIMIT_MEMLOCK`, common in containers and GitHub Actions runners).
Without raising the limit in CI, the io_uring code path would be
silently skipped on every PR and never exercised before release.
Add `sudo prlimit --pid $$ --memlock=unlimited:unlimited` (Linux-only,
fail-soft via `2>/dev/null || true` so it's a no-op on macOS/Windows
runners) at the top of:
- `.github/workflows/build_and_test.yml` "Run tests" step, so module
builds (yarn, core, network-shuffle, mllib, etc.) that hit
`IOMode.AUTO` actually use io_uring on Linux 5.10+ and
`ShuffleNettyIoUringSuite` runs instead of skipping via
`NettyUtils.isIoUringUsable`.
- `.github/workflows/benchmark.yml` "Run benchmarks" step, so
`NettyTransportBenchmark`'s NIO-vs-AUTO comparison and
file-backed shuffle suite measure io_uring rather than EPOLL.
The fail-soft is important: stock GHA Linux runners support sudo
prlimit, but stripped-down environments (e.g., custom containers used
by some matrix jobs) might not, and we don't want the CI step to fail
just because memlock could not be raised. The probe in `NettyUtils`
will then degrade to EPOLL as designed.
…nd_test.yml line 417 Inadvertently stripped by the previous CI commit's surrounding edit. Pure whitespace; no behavioral change.
…ount, not 1 The previous probe created a one-thread MultiThreadIoEventLoopGroup to verify io_uring ring allocation works. This is insufficient in environments (e.g., GHA Docker container jobs for pyspark) where the container's RLIMIT_MEMLOCK is just large enough for one io_uring ring but not the eight rings Spark allocates by default per event loop group. The probe would succeed, AUTO would pick io_uring, then TransportServer.init -> createEventLoop(numThreads=8) would crash with `failed to allocate memory for io_uring ring` and propagate the exception out of SparkContext construction. Probe with MAX_DEFAULT_NETTY_THREADS rings instead. This matches the worst-case allocation size Spark uses by default for a single event loop group, so any environment whose memlock can't support real Spark usage now correctly falls back to EPOLL at probe time. Users who explicitly raise spark.shuffle.io.serverThreads (or the analogous client/chunk-fetch knobs) above MAX_DEFAULT_NETTY_THREADS remain responsible for ensuring their environment can support the larger ring count; otherwise they should set spark.shuffle.io.mode to EPOLL explicitly. Observed in the pyspark CI matrix where runs sit inside a Docker container that does not honor `sudo prlimit --memlock=unlimited` from the workflow shell, leaving the JVM with the container's default memlock.
Contributor
Author
|
There are several issues in this netty version, which need to be further resolved in the next release. |
…ING opt-in only io_uring has known incompatibilities with some Spark configurations -- most notably `spark.authenticate.enableSaslEncryption=true`, where `SaslEncryption.EncryptedMessage` violates the FileRegion `count()` upper-bound assumption that io_uring's chunked-FileRegion fallback relies on. AUTO-selecting io_uring would silently expose every such workload to corruption. Restore AUTO to its prior behavior (EPOLL on Linux, KQUEUE on macOS, NIO fallback). Users who want io_uring opt in explicitly via `spark.shuffle.io.mode=IO_URING`. `isIoUringUsable()` is retained as a probe helper for tests; AUTO no longer consults it. Also update IOMode.AUTO Javadoc and NettyTransportBenchmark comments to match the new behavior.
… mode comparisons Extend Suite 3 (IOMode Comparison) and Suite 8 (File-Backed Shuffle Block Fetch) to also run an IO_URING case when NettyUtils.isIoUringUsable reports true. The IO_URING case skips on macOS, on Linux without io_uring support, and on CI runners with low RLIMIT_MEMLOCK; recorded results may therefore omit it, and io_uring numbers should be captured by running locally on Linux 5.10+ with prlimit --memlock=unlimited.
…k (JDK 17, Scala 2.13, split 1 of 1)
…e IO mode for diagnostics Adds two one-shot info logs to help triage io_uring vs EPOLL performance reports, particularly the shuffle-fetch regression visible in the NettyTransportBenchmark results committed in f0e1f15: * TransportContext ctor logs `(module, ioMode, role)` once per context so it is visible which Spark module ends up on IO_URING vs EPOLL vs NIO. * NettyUtils.createEventLoop logs IoUring.featureString(), the kernel version, and /proc/sys/fs/pipe-max-size the first time IO_URING is selected in a process. The pipe-max-size matters because Netty's IoUringFileRegion routes FileRegion sends through a splice(2) pipe (file -> pipe -> socket); a small pipe forces more SQE/CQE round-trips per shuffle chunk. No behavior change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?