bench: io_uring-backed LocalFileSystem for dfbench (Linux)#21793
bench: io_uring-backed LocalFileSystem for dfbench (Linux)#21793Dandandan wants to merge 2 commits intoapache:mainfrom
Conversation
Adds `UringLocalFileSystem`, an `ObjectStore` that routes byte-range reads through a dedicated io_uring driver thread and delegates all other operations to an inner `LocalFileSystem`. Wired into `CommonOpt::build_runtime` and registered for `file:///` by default on Linux; opt out with `DATAFUSION_IO_URING=0`. Non-Linux targets are unaffected — the module and the `io-uring` dep are both gated on `target_os = "linux"`. Design: - `UringSubmitter` owns an `mpsc::UnboundedSender<Cmd>`; any tokio task calls `submit_read` (sync) to enqueue an `IORING_OP_READ`, getting back a `oneshot::Receiver<Result<Bytes>>`. - A dedicated `io-uring-driver` thread owns the `IoUring` instance. Each iteration drains the mpsc up to the submission queue's free slots, `submit_and_wait(1)` to flush SQEs and block for at least one CQ entry when work is outstanding, then drains CQ entries and fires the oneshots. When nothing is in flight it `blocking_recv()` for the next command instead of spinning. - `get_ranges` submits all N ranges into the mpsc synchronously before awaiting any of them, so the driver sees the whole batch in one drain and hands the kernel all N at once — giving the device queue depth > 1 for cold reads instead of the one-pread-at-a-time pattern that `LocalFileSystem::get_ranges` produces today. - Buffers (`Box<[u8]>`) and the `Arc<File>` keeping the fd open are retained in the driver's `in_flight` map until the corresponding CQ entry arrives, so the kernel never writes into freed memory or a closed fd. Rough edges (called out in module-level docs): - No fd cache (one `open(2)` per `get_ranges` call, same as today). - No registered buffers / `IORING_OP_READV` — one SQE per range, heap allocation per op. - No `IORING_OP_ASYNC_CANCEL` on dropped-future cancellation; the submission runs to completion and its result is discarded. - Metrics / tracing not yet plumbed in. Intended as a first draft for benchmarking; good enough to A/B against the stock `LocalFileSystem` on ClickBench-style workloads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing io-uring-local-fs (8844eb9) to 9a1ed57 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing io-uring-local-fs (8844eb9) to 9a1ed57 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing io-uring-local-fs (8844eb9) to 9a1ed57 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
1 similar comment
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
The Debug derive on UringLocalFileSystem transitively requires Debug on its fields; UringSubmitter is one of them, and tokio's mpsc::UnboundedSender<T> already impls Debug regardless of T, so a plain derive is enough. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing io-uring-local-fs (fbeeee2) to 9a1ed57 (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing io-uring-local-fs (fbeeee2) to 9a1ed57 (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing io-uring-local-fs (fbeeee2) to 9a1ed57 (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
Benchmark for this request hit the 7200s job deadline before finishing. Benchmarks requested: Kubernetes messageFile an issue against this benchmark runner |
Which issue does this PR close?
Rationale for this change
When profiling DataFusion's local parquet reads under ClickBench,
object_store::LocalFileSystem::get_rangesserializes all range reads inside a singlespawn_blockingtask:One blocking thread, N sequential
seek + read_exactpairs. On NVMe devices with meaningful queue-depth capability, and on cold-cache reads, this leaves a lot of parallelism unused — the kernel block layer would happily service many concurrent reads if we asked it to.This PR adds a benchmark-only alternative
ObjectStore(no changes todatafusion/orobject_store) that routes the range reads through anio_uringsubmission queue, so N preads become N concurrent kernel-side operations. It's intended as a tool for A/B measurement rather than a production-quality replacement.What changes are included in this PR?
benchmarks/src/util/uring_local_fs.rs(new, ~480 lines): aUringLocalFileSystemimplementingobject_store::ObjectStore. It owns aninner: LocalFileSystemfor non-read ops and a dedicatedio-uring-driverOS thread that owns theIoUringinstance and the submission/completion loop.benchmarks/src/util/mod.rs: registers the module under#[cfg(target_os = \"linux\")].benchmarks/src/util/options.rs:CommonOpt::build_runtimeregistersUringLocalFileSystemforfile:///by default on Linux, withDATAFUSION_IO_URING=0as the opt-out. Layers with--simulate-latencyas expected (LatencyObjectStorewraps the uring store).benchmarks/Cargo.toml:io-uring = \"0.7\"added under[target.'cfg(target_os = \"linux\")'.dependencies], so non-Linux targets don't pull it in.Driver shape:
submit_read(Arc<File>, offset, len)— a sync fn — which sends aCmd::Readover an mpsc and returns aoneshot::Receiver<io::Result<Bytes>>. This is sync on purpose:get_rangesenqueues all N ranges before awaiting any of them, so the driver sees the whole batch in onetry_recvdrain.submit_and_wait(1)to flush and block for at least one completion when work is outstanding, then drains the CQ and fires the oneshots. Idles withblocking_recv()when empty.Box<[u8]>) and the keep-aliveArc<File>live in the driver'sin_flightmap until the corresponding CQ arrives — the kernel never writes into freed memory or a closed fd.Known rough edges (documented in the module header):
open(2)perget_rangescall (same as today).IORING_OP_READV— one SQE per range, heap allocation per op.IORING_OP_ASYNC_CANCELon dropped-future cancellation; the submission runs to completion and its result is discarded.Not included in this PR: any change to
object_storeordatafusioncore, or any production path. All non-Linux users get the stockLocalFileSystemvia the existing cfg-gated code.Are these changes tested?
cargo check -p datafusion-benchmarksandcargo clippy -p datafusion-benchmarks --all-targets -- -D warningspass on macOS (Linux module is cfg-d out)../target/release-nonlto/dfbench clickbench --iterations 3 --path <hits_partitioned> --queries-path benchmarks/queries/clickbench/querieswith and withoutDATAFUSION_IO_URING=0is the expected first validation.Are there any user-facing changes?
Only within
dfbenchon Linux:Using io_uring-backed LocalFileSystemso it's visible which backend is in effect.DATAFUSION_IO_URING=0in the environment restores the stockLocalFileSystem.No API changes. No changes to any crate that downstream users depend on.