Use platform specific `read_at` when available #628

AdamGS · 2026-01-29T11:53:01Z

Which issue does this PR close?

Closes #622.

Rationale for this change

Saves on at least one seek syscall per range

What changes are included in this PR?

On windows and unix, make use of their platform specific FileExt::read_at functions.
Unifies the read implementation uses in LocalStore::read_ranges and for local files in GetResult::bytes, so this codepath is shared across both implementation which are currently very similar with one minor change.

Are there any user-facing changes?

No

Signed-off-by: Adam Gutglick <adamgsal@gmail.com>

AdamGS · 2026-01-29T11:55:33Z

Running the following benchmark, I see a pretty solid improvement on my macbook pro, but I'll try and find the time to benchmark it on some AWS box.

Criterion results (compared to most recent main as of this writing):

read_ranges/local_fs/10 time:   [23.699 µs 23.957 µs 24.288 µs]
                        thrpt:  [3.1412 GiB/s 3.1847 GiB/s 3.2192 GiB/s]
                 change:
                        time:   [−7.8715% −5.3115% −1.1817%] (p = 0.00 < 0.05)
                        thrpt:  [+1.1958% +5.6094% +8.5440%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
read_ranges/local_fs/100
                        time:   [86.424 µs 87.314 µs 88.361 µs]
                        thrpt:  [8.6343 GiB/s 8.7379 GiB/s 8.8278 GiB/s]
                 change:
                        time:   [−25.457% −24.045% −22.676%] (p = 0.00 < 0.05)
                        thrpt:  [+29.326% +31.658% +34.150%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe
Benchmarking read_ranges/local_fs/1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
read_ranges/local_fs/1000
                        time:   [984.16 µs 996.91 µs 1.0117 ms]
                        thrpt:  [7.5414 GiB/s 7.6530 GiB/s 7.7522 GiB/s]
                 change:
                        time:   [−22.534% −20.385% −18.156%] (p = 0.00 < 0.05)
                        thrpt:  [+22.183% +25.604% +29.088%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe

get_opts_whole_file/local_fs
                        time:   [7.8658 ms 7.8872 ms 7.9107 ms]
                        thrpt:  [7.9007 GiB/s 7.9242 GiB/s 7.9458 GiB/s]
                 change:
                        time:   [−1.6425% −1.2214% −0.8096%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8162% +1.2365% +1.6699%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

get_range_sequential/local_fs/10
                        time:   [206.23 µs 209.19 µs 212.59 µs]
                        thrpt:  [367.50 MiB/s 373.46 MiB/s 378.82 MiB/s]
                 change:
                        time:   [−15.312% −14.118% −12.898%] (p = 0.00 < 0.05)
                        thrpt:  [+14.808% +16.439% +18.081%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) high mild
  3 (3.00%) high severe
get_range_sequential/local_fs/100
                        time:   [2.0626 ms 2.0773 ms 2.0948 ms]
                        thrpt:  [372.95 MiB/s 376.08 MiB/s 378.77 MiB/s]
                 change:
                        time:   [−14.729% −13.884% −12.938%] (p = 0.00 < 0.05)
                        thrpt:  [+14.861% +16.123% +17.273%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
get_range_sequential/local_fs/1000
                        time:   [23.157 ms 23.363 ms 23.596 ms]
                        thrpt:  [331.09 MiB/s 334.39 MiB/s 337.37 MiB/s]
                 change:
                        time:   [−1.0249% +0.0131% +1.2791%] (p = 0.98 > 0.05)
                        thrpt:  [−1.2630% −0.0131% +1.0356%]
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

Code for the benchmark, mostly doing 8KB reads and some full file reads (64MB), happy to add more cases that might be interesting:

use criterion::{BenchmarkId, Criterion, Throughput, criterion_group, criterion_main};
use object_store::local::LocalFileSystem;
use object_store::path::Path;
use object_store::{ObjectStore, ObjectStoreExt};
use rand::Rng;
use std::ops::Range;
use tempfile::TempDir;

const FILE_SIZE: u64 = 64 * 1024 * 1024; // 64 MB
const RANGE_SIZE: u64 = 8 * 1024; // 8 KB ranges

fn generate_random_ranges(file_size: u64, range_size: u64, count: usize) -> Vec<Range<u64>> {
    let mut rng = rand::rng();
    (0..count)
        .map(|_| {
            let start = rng.random_range(0..file_size - range_size);
            start..start + range_size
        })
        .collect()
}

fn bench_read_ranges(c: &mut Criterion) {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();

    // Set up the test file
    let temp_dir = TempDir::new().unwrap();
    let store = LocalFileSystem::new_with_prefix(temp_dir.path()).unwrap();
    let path = Path::from("bench_file");

    // Create file with random data
    let data: Vec<u8> = (0..FILE_SIZE).map(|i| (i % 256) as u8).collect();
    rt.block_on(async {
        store.put(&path, data.into()).await.unwrap();
    });

    let mut group = c.benchmark_group("read_ranges");

    for num_ranges in [10, 100, 1000] {
        let ranges = generate_random_ranges(FILE_SIZE, RANGE_SIZE, num_ranges);
        let total_bytes = num_ranges as u64 * RANGE_SIZE;

        group.throughput(Throughput::Bytes(total_bytes));
        group.bench_with_input(
            BenchmarkId::new("local_fs", num_ranges),
            &ranges,
            |b, ranges| {
                b.to_async(&rt)
                    .iter(|| async { store.get_ranges(&path, ranges).await.unwrap() });
            },
        );
    }

    group.finish();
}

fn bench_get_opts_whole_file(c: &mut Criterion) {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();

    let temp_dir = TempDir::new().unwrap();
    let store = LocalFileSystem::new_with_prefix(temp_dir.path()).unwrap();
    let path = Path::from("bench_file");

    let data: Vec<u8> = (0..FILE_SIZE).map(|i| (i % 256) as u8).collect();
    rt.block_on(async {
        store.put(&path, data.into()).await.unwrap();
    });

    let mut group = c.benchmark_group("get_opts_whole_file");
    group.throughput(Throughput::Bytes(FILE_SIZE));

    group.bench_function("local_fs", |b| {
        b.to_async(&rt).iter(|| async {
            store
                .get_opts(&path, Default::default())
                .await
                .unwrap()
                .bytes()
                .await
                .unwrap()
        });
    });

    group.finish();
}

fn bench_get_range_sequential(c: &mut Criterion) {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();

    let temp_dir = TempDir::new().unwrap();
    let store = LocalFileSystem::new_with_prefix(temp_dir.path()).unwrap();
    let path = Path::from("bench_file");

    let data: Vec<u8> = (0..FILE_SIZE).map(|i| (i % 256) as u8).collect();
    rt.block_on(async {
        store.put(&path, data.into()).await.unwrap();
    });

    let mut group = c.benchmark_group("get_range_sequential");

    for num_ranges in [10, 100, 1000] {
        let ranges = generate_random_ranges(FILE_SIZE, RANGE_SIZE, num_ranges);
        let total_bytes = num_ranges as u64 * RANGE_SIZE;

        group.throughput(Throughput::Bytes(total_bytes));
        group.bench_with_input(
            BenchmarkId::new("local_fs", num_ranges),
            &ranges,
            |b, ranges| {
                b.to_async(&rt).iter(|| async {
                    for range in ranges {
                        store.get_range(&path, range.clone()).await.unwrap();
                    }
                });
            },
        );
    }

    group.finish();
}

criterion_group!(
    benches,
    bench_read_ranges,
    bench_get_opts_whole_file,
    bench_get_range_sequential
);
criterion_main!(benches);

AdamGS · 2026-01-29T11:56:31Z

src/lib.rs

-                        .map_err(|source| local::Error::UnableToReadBytes { source, path })?;
-
-                    Ok(buffer.into())
+                    use crate::local::read_range;


this makes me think that maybe read_range belongs elsewhere, maybe a new file module?

AdamGS · 2026-01-29T11:57:40Z

src/local.rs

+        let mut buf_slice = &mut buf[..];
+        let mut offset = range.start;
+
+        while !buf_slice.is_empty() {


this is roughly the implementation of read_exact_at, just allows us to keep the same errors and track how many bytes we've read for the OutOfRange error branch.

AdamGS · 2026-01-29T11:59:50Z

src/local.rs

+        // Safety:
+        // Setting the buffer's length to its capacity is safe as it remains within its allocated memory.
+        // In cases where `read_exact_at` errors, the contents of the buffer are undefined,
+        // but we discard it without using it.
+        unsafe {
+            buf.set_len(to_read as usize);
+        }


This is the first use of unsafe in the crate, which I'm not thrilled about. The alternative seems to be to pre-fill it with Vec::resize, but that seems wasteful.

I think the later &mut buf[..] makes this UB.
What is the performance when initializing buf with vec![0u8; 8]?

The alternative would be reimplementing/vendoring part of this API https://doc.rust-lang.org/std/os/unix/fs/trait.FileExt.html#method.read_buf_at

Or a suggestion by Gemini (raw file IO):

use std::os::unix::io::AsRawFd; let fd = file.as_raw_fd(); let mut buf = Vec::with_capacity(to_read); let bytes_read = unsafe { // libc::pread takes a raw pointer (*mut c_void), avoiding the reference creation UB libc::pread( fd, buf.as_mut_ptr() as *mut _, to_read, offset as i64 ) }; if bytes_read >= 0 { unsafe { buf.set_len(bytes_read as usize) }; } else { // Handle error }

Ah I see it even triggers https://rust-lang.github.io/rust-clippy/rust-1.93.0/index.html#uninit_vec now

I'll try it out with pre-allocated data

just preallocated the buffer with 0s (buf.resize(to_read as usize, 0_u8)), results are:

read_ranges/local_fs/10 time: [24.271 µs 24.605 µs 24.979 µs] thrpt: [3.0543 GiB/s 3.1008 GiB/s 3.1434 GiB/s] change: time: [−9.2694% −7.9576% −6.6172%] (p = 0.00 < 0.05) thrpt: [+7.0861% +8.6456% +10.216%] Performance has improved. Found 9 outliers among 100 measurements (9.00%) 5 (5.00%) high mild 4 (4.00%) high severe read_ranges/local_fs/100 time: [91.591 µs 92.250 µs 92.981 µs] thrpt: [8.2054 GiB/s 8.2703 GiB/s 8.3299 GiB/s] change: time: [−19.988% −19.050% −18.121%] (p = 0.00 < 0.05) thrpt: [+22.131% +23.532% +24.982%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 2 (2.00%) low mild 5 (5.00%) high mild 5 (5.00%) high severe read_ranges/local_fs/1000 time: [877.00 µs 891.02 µs 907.75 µs] thrpt: [8.4047 GiB/s 8.5626 GiB/s 8.6994 GiB/s] change: time: [−32.377% −30.518% −28.468%] (p = 0.00 < 0.05) thrpt: [+39.797% +43.921% +47.879%] Performance has improved. Found 8 outliers among 100 measurements (8.00%) 4 (4.00%) high mild 4 (4.00%) high severe get_opts_whole_file/local_fs time: [8.2127 ms 8.2487 ms 8.2891 ms] thrpt: [7.5400 GiB/s 7.5770 GiB/s 7.6101 GiB/s] change: time: [−3.7223% −2.3673% −1.0502%] (p = 0.00 < 0.05) thrpt: [+1.0613% +2.4247% +3.8663%] Performance has improved. Found 12 outliers among 100 measurements (12.00%) 3 (3.00%) high mild 9 (9.00%) high severe get_range_sequential/local_fs/10 time: [223.74 µs 225.81 µs 228.25 µs] thrpt: [342.28 MiB/s 345.97 MiB/s 349.18 MiB/s] change: time: [−8.4366% −7.0562% −5.8343%] (p = 0.00 < 0.05) thrpt: [+6.1958% +7.5919% +9.2139%] Performance has improved. Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe get_range_sequential/local_fs/100 time: [2.2283 ms 2.2480 ms 2.2706 ms] thrpt: [344.07 MiB/s 347.54 MiB/s 350.61 MiB/s] change: time: [−8.9263% −7.8699% −6.7098%] (p = 0.00 < 0.05) thrpt: [+7.1924% +8.5421% +9.8012%] Performance has improved. Found 9 outliers among 100 measurements (9.00%) 6 (6.00%) high mild 3 (3.00%) high severe get_range_sequential/local_fs/1000 time: [23.699 ms 23.843 ms 24.011 ms] thrpt: [325.37 MiB/s 327.66 MiB/s 329.65 MiB/s] change: time: [+1.8221% +2.7832% +3.8109%] (p = 0.00 < 0.05) thrpt: [−3.6710% −2.7079% −1.7895%] Performance has regressed. Found 8 outliers among 100 measurements (8.00%) 2 (2.00%) low mild 6 (6.00%) high severe

maybe a bit worse for get_opts but still shows an improvement, otherwise very similar within the noise of a laptop.

Dandandan · 2026-01-29T19:29:12Z

src/local.rs

-    // extents of the file
-    if range.start >= file_len {
+    // extents of the file, or if its at the end of the file and wants to read a non-empty range.
+    if range.start > file_len || (range.start == file_len && !range.is_empty()) {


Isn't the added condition equivalent to range.end > file_len ?

from line 994-995:

// Don't read past end of file let to_read = range.end.min(file_len) - range.start;

I think its actually technically ok to read with a range that ends past the end of the file (I think HTTP RANGE headers allow that too), the only issue is when trying to start at the end and reading more. I'll try and figure out if this error is actually tested, and add a test if not.

AdamGS · 2026-01-29T19:37:47Z

seems like I need some more work on the windows size

AdamGS added 4 commits January 29, 2026 10:49

read_range with exact API when available

09812d5

Signed-off-by: Adam Gutglick <adamgsal@gmail.com>

Better error handling

c630c00

Unify file read

931f8ab

just read range

343dc93

AdamGS commented Jan 29, 2026

View reviewed changes

AdamGS mentioned this pull request Jan 29, 2026

Reuse file descriptors for LocalFileSystem object storage apache/datafusion#19983

Open

AdamGS added 3 commits January 29, 2026 14:56

Move use statements to top of file

a53f536

dont do unsafe

5c3ec78

Fix clippy issues

3354ce8

AdamGS changed the title ~~[WIP] Use platform specific read_at when available~~ Use platform specific read_at when available Jan 29, 2026

AdamGS marked this pull request as ready for review January 29, 2026 19:22

Dandandan reviewed Jan 29, 2026

View reviewed changes

Fix windows

c476513

Use platform specific read_at when available #628

Are you sure you want to change the base?

Use platform specific read_at when available #628

Conversation

AdamGS commented Jan 29, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

AdamGS commented Jan 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AdamGS commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Use platform specific `read_at` when available #628

Use platform specific `read_at` when available #628

Dandandan Jan 29, 2026 •

edited

Loading

Dandandan Jan 29, 2026 •

edited

Loading