Skip to content

Conversation

@AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Jan 29, 2026

Which issue does this PR close?

Closes #622.

Rationale for this change

Saves on at least one seek syscall per range

What changes are included in this PR?

  1. On windows and unix, make use of their platform specific FileExt::read_at functions.
  2. Unifies the read implementation uses in LocalStore::read_ranges and for local files in GetResult::bytes, so this codepath is shared across both implementation which are currently very similar with one minor change.

Are there any user-facing changes?

No

@AdamGS
Copy link
Contributor Author

AdamGS commented Jan 29, 2026

Running the following benchmark, I see a pretty solid improvement on my macbook pro, but I'll try and find the time to benchmark it on some AWS box.

Criterion results (compared to most recent main as of this writing):

read_ranges/local_fs/10 time:   [23.699 µs 23.957 µs 24.288 µs]
                        thrpt:  [3.1412 GiB/s 3.1847 GiB/s 3.2192 GiB/s]
                 change:
                        time:   [−7.8715% −5.3115% −1.1817%] (p = 0.00 < 0.05)
                        thrpt:  [+1.1958% +5.6094% +8.5440%]
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
read_ranges/local_fs/100
                        time:   [86.424 µs 87.314 µs 88.361 µs]
                        thrpt:  [8.6343 GiB/s 8.7379 GiB/s 8.8278 GiB/s]
                 change:
                        time:   [−25.457% −24.045% −22.676%] (p = 0.00 < 0.05)
                        thrpt:  [+29.326% +31.658% +34.150%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe
Benchmarking read_ranges/local_fs/1000: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.2s, enable flat sampling, or reduce sample count to 60.
read_ranges/local_fs/1000
                        time:   [984.16 µs 996.91 µs 1.0117 ms]
                        thrpt:  [7.5414 GiB/s 7.6530 GiB/s 7.7522 GiB/s]
                 change:
                        time:   [−22.534% −20.385% −18.156%] (p = 0.00 < 0.05)
                        thrpt:  [+22.183% +25.604% +29.088%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high severe

get_opts_whole_file/local_fs
                        time:   [7.8658 ms 7.8872 ms 7.9107 ms]
                        thrpt:  [7.9007 GiB/s 7.9242 GiB/s 7.9458 GiB/s]
                 change:
                        time:   [−1.6425% −1.2214% −0.8096%] (p = 0.00 < 0.05)
                        thrpt:  [+0.8162% +1.2365% +1.6699%]
                        Change within noise threshold.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

get_range_sequential/local_fs/10
                        time:   [206.23 µs 209.19 µs 212.59 µs]
                        thrpt:  [367.50 MiB/s 373.46 MiB/s 378.82 MiB/s]
                 change:
                        time:   [−15.312% −14.118% −12.898%] (p = 0.00 < 0.05)
                        thrpt:  [+14.808% +16.439% +18.081%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  8 (8.00%) high mild
  3 (3.00%) high severe
get_range_sequential/local_fs/100
                        time:   [2.0626 ms 2.0773 ms 2.0948 ms]
                        thrpt:  [372.95 MiB/s 376.08 MiB/s 378.77 MiB/s]
                 change:
                        time:   [−14.729% −13.884% −12.938%] (p = 0.00 < 0.05)
                        thrpt:  [+14.861% +16.123% +17.273%]
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
get_range_sequential/local_fs/1000
                        time:   [23.157 ms 23.363 ms 23.596 ms]
                        thrpt:  [331.09 MiB/s 334.39 MiB/s 337.37 MiB/s]
                 change:
                        time:   [−1.0249% +0.0131% +1.2791%] (p = 0.98 > 0.05)
                        thrpt:  [−1.2630% −0.0131% +1.0356%]
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) high mild
  5 (5.00%) high severe

Code for the benchmark, mostly doing 8KB reads and some full file reads (64MB), happy to add more cases that might be interesting:

use criterion::{BenchmarkId, Criterion, Throughput, criterion_group, criterion_main};
use object_store::local::LocalFileSystem;
use object_store::path::Path;
use object_store::{ObjectStore, ObjectStoreExt};
use rand::Rng;
use std::ops::Range;
use tempfile::TempDir;

const FILE_SIZE: u64 = 64 * 1024 * 1024; // 64 MB
const RANGE_SIZE: u64 = 8 * 1024; // 8 KB ranges

fn generate_random_ranges(file_size: u64, range_size: u64, count: usize) -> Vec<Range<u64>> {
    let mut rng = rand::rng();
    (0..count)
        .map(|_| {
            let start = rng.random_range(0..file_size - range_size);
            start..start + range_size
        })
        .collect()
}

fn bench_read_ranges(c: &mut Criterion) {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();

    // Set up the test file
    let temp_dir = TempDir::new().unwrap();
    let store = LocalFileSystem::new_with_prefix(temp_dir.path()).unwrap();
    let path = Path::from("bench_file");

    // Create file with random data
    let data: Vec<u8> = (0..FILE_SIZE).map(|i| (i % 256) as u8).collect();
    rt.block_on(async {
        store.put(&path, data.into()).await.unwrap();
    });

    let mut group = c.benchmark_group("read_ranges");

    for num_ranges in [10, 100, 1000] {
        let ranges = generate_random_ranges(FILE_SIZE, RANGE_SIZE, num_ranges);
        let total_bytes = num_ranges as u64 * RANGE_SIZE;

        group.throughput(Throughput::Bytes(total_bytes));
        group.bench_with_input(
            BenchmarkId::new("local_fs", num_ranges),
            &ranges,
            |b, ranges| {
                b.to_async(&rt)
                    .iter(|| async { store.get_ranges(&path, ranges).await.unwrap() });
            },
        );
    }

    group.finish();
}

fn bench_get_opts_whole_file(c: &mut Criterion) {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();

    let temp_dir = TempDir::new().unwrap();
    let store = LocalFileSystem::new_with_prefix(temp_dir.path()).unwrap();
    let path = Path::from("bench_file");

    let data: Vec<u8> = (0..FILE_SIZE).map(|i| (i % 256) as u8).collect();
    rt.block_on(async {
        store.put(&path, data.into()).await.unwrap();
    });

    let mut group = c.benchmark_group("get_opts_whole_file");
    group.throughput(Throughput::Bytes(FILE_SIZE));

    group.bench_function("local_fs", |b| {
        b.to_async(&rt).iter(|| async {
            store
                .get_opts(&path, Default::default())
                .await
                .unwrap()
                .bytes()
                .await
                .unwrap()
        });
    });

    group.finish();
}

fn bench_get_range_sequential(c: &mut Criterion) {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();

    let temp_dir = TempDir::new().unwrap();
    let store = LocalFileSystem::new_with_prefix(temp_dir.path()).unwrap();
    let path = Path::from("bench_file");

    let data: Vec<u8> = (0..FILE_SIZE).map(|i| (i % 256) as u8).collect();
    rt.block_on(async {
        store.put(&path, data.into()).await.unwrap();
    });

    let mut group = c.benchmark_group("get_range_sequential");

    for num_ranges in [10, 100, 1000] {
        let ranges = generate_random_ranges(FILE_SIZE, RANGE_SIZE, num_ranges);
        let total_bytes = num_ranges as u64 * RANGE_SIZE;

        group.throughput(Throughput::Bytes(total_bytes));
        group.bench_with_input(
            BenchmarkId::new("local_fs", num_ranges),
            &ranges,
            |b, ranges| {
                b.to_async(&rt).iter(|| async {
                    for range in ranges {
                        store.get_range(&path, range.clone()).await.unwrap();
                    }
                });
            },
        );
    }

    group.finish();
}

criterion_group!(
    benches,
    bench_read_ranges,
    bench_get_opts_whole_file,
    bench_get_range_sequential
);
criterion_main!(benches);

.map_err(|source| local::Error::UnableToReadBytes { source, path })?;

Ok(buffer.into())
use crate::local::read_range;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me think that maybe read_range belongs elsewhere, maybe a new file module?

let mut buf_slice = &mut buf[..];
let mut offset = range.start;

while !buf_slice.is_empty() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is roughly the implementation of read_exact_at, just allows us to keep the same errors and track how many bytes we've read for the OutOfRange error branch.

src/local.rs Outdated
Comment on lines 1000 to 1006
// Safety:
// Setting the buffer's length to its capacity is safe as it remains within its allocated memory.
// In cases where `read_exact_at` errors, the contents of the buffer are undefined,
// but we discard it without using it.
unsafe {
buf.set_len(to_read as usize);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first use of unsafe in the crate, which I'm not thrilled about. The alternative seems to be to pre-fill it with Vec::resize, but that seems wasteful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the later &mut buf[..] makes this UB.
What is the performance when initializing buf with vec![0u8; 8]?

Copy link

@Dandandan Dandandan Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative would be reimplementing/vendoring part of this API https://doc.rust-lang.org/std/os/unix/fs/trait.FileExt.html#method.read_buf_at

Copy link

@Dandandan Dandandan Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or a suggestion by Gemini (raw file IO):

use std::os::unix::io::AsRawFd;

let fd = file.as_raw_fd();
let mut buf = Vec::with_capacity(to_read);

let bytes_read = unsafe {
    // libc::pread takes a raw pointer (*mut c_void), avoiding the reference creation UB
    libc::pread(
        fd, 
        buf.as_mut_ptr() as *mut _, 
        to_read, 
        offset as i64
    )
};

if bytes_read >= 0 {
    unsafe { buf.set_len(bytes_read as usize) };
} else {
    // Handle error
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try it out with pre-allocated data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just preallocated the buffer with 0s (buf.resize(to_read as usize, 0_u8)), results are:

read_ranges/local_fs/10 time:   [24.271 µs 24.605 µs 24.979 µs]
                        thrpt:  [3.0543 GiB/s 3.1008 GiB/s 3.1434 GiB/s]
                 change:
                        time:   [−9.2694% −7.9576% −6.6172%] (p = 0.00 < 0.05)
                        thrpt:  [+7.0861% +8.6456% +10.216%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  5 (5.00%) high mild
  4 (4.00%) high severe
read_ranges/local_fs/100
                        time:   [91.591 µs 92.250 µs 92.981 µs]
                        thrpt:  [8.2054 GiB/s 8.2703 GiB/s 8.3299 GiB/s]
                 change:
                        time:   [−19.988% −19.050% −18.121%] (p = 0.00 < 0.05)
                        thrpt:  [+22.131% +23.532% +24.982%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  5 (5.00%) high severe
read_ranges/local_fs/1000
                        time:   [877.00 µs 891.02 µs 907.75 µs]
                        thrpt:  [8.4047 GiB/s 8.5626 GiB/s 8.6994 GiB/s]
                 change:
                        time:   [−32.377% −30.518% −28.468%] (p = 0.00 < 0.05)
                        thrpt:  [+39.797% +43.921% +47.879%]
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

get_opts_whole_file/local_fs
                        time:   [8.2127 ms 8.2487 ms 8.2891 ms]
                        thrpt:  [7.5400 GiB/s 7.5770 GiB/s 7.6101 GiB/s]
                 change:
                        time:   [−3.7223% −2.3673% −1.0502%] (p = 0.00 < 0.05)
                        thrpt:  [+1.0613% +2.4247% +3.8663%]
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  3 (3.00%) high mild
  9 (9.00%) high severe

get_range_sequential/local_fs/10
                        time:   [223.74 µs 225.81 µs 228.25 µs]
                        thrpt:  [342.28 MiB/s 345.97 MiB/s 349.18 MiB/s]
                 change:
                        time:   [−8.4366% −7.0562% −5.8343%] (p = 0.00 < 0.05)
                        thrpt:  [+6.1958% +7.5919% +9.2139%]
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
get_range_sequential/local_fs/100
                        time:   [2.2283 ms 2.2480 ms 2.2706 ms]
                        thrpt:  [344.07 MiB/s 347.54 MiB/s 350.61 MiB/s]
                 change:
                        time:   [−8.9263% −7.8699% −6.7098%] (p = 0.00 < 0.05)
                        thrpt:  [+7.1924% +8.5421% +9.8012%]
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  6 (6.00%) high mild
  3 (3.00%) high severe
get_range_sequential/local_fs/1000
                        time:   [23.699 ms 23.843 ms 24.011 ms]
                        thrpt:  [325.37 MiB/s 327.66 MiB/s 329.65 MiB/s]
                 change:
                        time:   [+1.8221% +2.7832% +3.8109%] (p = 0.00 < 0.05)
                        thrpt:  [−3.6710% −2.7079% −1.7895%]
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  2 (2.00%) low mild
  6 (6.00%) high severe

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a bit worse for get_opts but still shows an improvement, otherwise very similar within the noise of a laptop.

@AdamGS AdamGS changed the title [WIP] Use platform specific read_at when available Use platform specific read_at when available Jan 29, 2026
@AdamGS AdamGS marked this pull request as ready for review January 29, 2026 19:22
// extents of the file
if range.start >= file_len {
// extents of the file, or if its at the end of the file and wants to read a non-empty range.
if range.start > file_len || (range.start == file_len && !range.is_empty()) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the added condition equivalent to range.end > file_len ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from line 994-995:

// Don't read past end of file
let to_read = range.end.min(file_len) - range.start;

I think its actually technically ok to read with a range that ends past the end of the file (I think HTTP RANGE headers allow that too), the only issue is when trying to start at the end and reading more. I'll try and figure out if this error is actually tested, and add a test if not.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@AdamGS
Copy link
Contributor Author

AdamGS commented Jan 29, 2026

seems like I need some more work on the windows size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LocalFileSystem: use read_at instead of seek + read

2 participants