Skip to content

perf: benchmarking larger chunks + non-repeated values#205

Merged
d-v-b merged 1 commit into
d-v-b:perf/prepared-write-v2from
ilan-gold:ig/benchmarking
Jun 24, 2026
Merged

perf: benchmarking larger chunks + non-repeated values#205
d-v-b merged 1 commit into
d-v-b:perf/prepared-write-v2from
ilan-gold:ig/benchmarking

Conversation

@ilan-gold

@ilan-gold ilan-gold commented Jun 24, 2026

Copy link
Copy Markdown

This PR benchmarks more than just repeated values which compress comically well, which makes the benchmark somewhat unrealistic, coupled with the fact that the chunks uncompressed were smaller than I think most people use.

Here's the full results table on my Linux virtual machine from DENBI:

Details
                                                                                       Benchmark Results                                                                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━┓
┃                                                                                                                                      Benchmark ┃ Time (best) ┃ Rel. StdDev ┃ Run time ┃ Iters ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━┩
│                              test_write_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │       1.25s │        1.3% │    5.31s │     3 │
│                           test_write_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │       1.70s │        2.6% │    7.09s │     3 │
│                         test_write_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │       1.39s │        2.8% │    4.75s │     3 │
│                      test_write_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       1.75s │        9.4% │    6.27s │     3 │
│                      test_write_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │     573.9ms │        4.4% │    2.20s │     3 │
│                   test_write_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │       1.11s │        5.2% │    4.09s │     3 │
│                      test_write_array[batched-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │       5.16s │        4.0% │   16.64s │     3 │
│                   test_write_array[batched-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │       8.03s │        9.0% │   36.01s │     3 │
│                          test_write_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │       3.55s │        3.8% │   11.57s │     3 │
│                       test_write_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │       3.78s │        2.7% │   12.07s │     3 │
│                     test_write_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │       3.75s │        0.5% │   11.79s │     3 │
│                  test_write_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       3.96s │        4.5% │   12.75s │     3 │
│                  test_write_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │     620.5ms │        0.4% │    2.32s │     3 │
│               test_write_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │     762.6ms │       19.0% │    3.25s │     3 │
│                  test_write_array[batched-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │       5.36s │        1.1% │   16.60s │     3 │
│               test_write_array[batched-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │       9.20s │        2.5% │   28.62s │     3 │
│                  test_write_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │     975.7ms │        7.2% │    3.57s │     3 │
│               test_write_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │       1.11s │        1.0% │    3.86s │     3 │
│             test_write_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │       1.24s │       11.3% │    4.80s │     3 │
│          test_write_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       1.49s │        3.1% │    5.31s │     3 │
│          test_write_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │      19.4ms │       10.1% │    0.54s │     3 │
│       test_write_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │       1.18s │        1.4% │    4.07s │     3 │
│          test_write_array[fused_full_threaded-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │     330.9ms │        5.7% │    1.43s │     3 │
│       test_write_array[fused_full_threaded-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │       4.42s │       10.1% │   16.41s │     3 │
│              test_write_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │       3.69s │        2.6% │   12.01s │     3 │
│           test_write_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │       3.62s │        2.3% │   11.60s │     3 │
│         test_write_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │       3.84s │        1.1% │   12.16s │     3 │
│      test_write_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       3.85s │        1.3% │   12.16s │     3 │
│      test_write_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │     266.6ms │        7.9% │    1.30s │     3 │
│   test_write_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │       1.30s │        3.1% │    4.37s │     3 │
│      test_write_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │       5.37s │        2.2% │   16.99s │     3 │
│   test_write_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │       5.68s │        1.5% │   17.69s │     3 │
│                test_write_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │     285.9ms │        8.3% │    1.39s │     3 │
│             test_write_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │       1.38s │        2.7% │    4.80s │     3 │
│           test_write_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │     414.2ms │        8.7% │    2.76s │     3 │
│        test_write_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       1.50s │        2.8% │    5.18s │     3 │
│        test_write_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │      20.4ms │       11.5% │    0.47s │     3 │
│     test_write_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │       1.20s │        1.3% │    4.09s │     3 │
│        test_write_array[fused_single_threaded-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │     136.0ms │        3.0% │    0.82s │     3 │
│     test_write_array[fused_single_threaded-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │      11.70s │        0.1% │   36.40s │     3 │
│            test_write_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │       3.62s │        4.3% │   11.96s │     3 │
│         test_write_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │       4.59s │        2.4% │   14.55s │     3 │
│       test_write_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │       3.84s │        1.5% │   12.18s │     3 │
│    test_write_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       4.74s │        2.1% │   14.84s │     3 │
│    test_write_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │     240.8ms │        2.5% │    1.40s │     3 │
│ test_write_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │       1.24s │        0.6% │    4.16s │     3 │
│    test_write_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │       2.58s │        3.0% │    8.31s │     3 │
│ test_write_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │      12.32s │        0.8% │   37.83s │     3 │
│                               test_read_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │     873.8ms │        3.7% │    3.17s │     3 │
│                            test_read_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │     984.5ms │        5.5% │    3.60s │     3 │
│                          test_read_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │       1.15s │        3.6% │    3.99s │     3 │
│                       test_read_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       1.25s │        8.9% │    4.44s │     3 │
│                       test_read_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │     418.3ms │       14.1% │    1.79s │     3 │
│                    test_read_array[batched-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │     554.6ms │        5.5% │    2.14s │     3 │
│                       test_read_array[batched-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │       4.06s │       11.1% │   13.47s │     3 │
│                    test_read_array[batched-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │       5.39s │        5.2% │   18.10s │     3 │
│                           test_read_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │       3.60s │        3.5% │   11.67s │     3 │
│                        test_read_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │       3.64s │        1.3% │   12.01s │     3 │
│                      test_read_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │       3.80s │        0.8% │   12.14s │     3 │
│                   test_read_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       3.72s │        1.8% │   11.77s │     3 │
│                   test_read_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │     451.8ms │       11.8% │    1.90s │     3 │
│                test_read_array[batched-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │     518.6ms │        5.0% │    1.99s │     3 │
│                   test_read_array[batched-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │       3.83s │        8.2% │   12.89s │     3 │
│                test_read_array[batched-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │       4.77s │        9.0% │   15.95s │     3 │
│                   test_read_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │     482.9ms │        5.5% │    2.11s │     3 │
│                test_read_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │     546.4ms │        0.2% │    2.31s │     3 │
│              test_read_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │     966.4ms │        1.5% │    3.66s │     3 │
│           test_read_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       1.02s │        0.6% │    3.50s │     3 │
│           test_read_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │     111.4ms │       16.9% │    0.70s │     3 │
│        test_read_array[fused_full_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │     268.1ms │        4.6% │    1.51s │     3 │
│           test_read_array[fused_full_threaded-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │       1.97s │        1.8% │    6.43s │     3 │
│        test_read_array[fused_full_threaded-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │       1.75s │       10.7% │    7.10s │     3 │
│               test_read_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │       3.29s │        5.7% │   10.99s │     3 │
│            test_read_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │       3.61s │        1.6% │   12.33s │     3 │
│          test_read_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │       3.59s │        2.3% │   11.34s │     3 │
│       test_read_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       3.65s │        3.2% │   11.54s │     3 │
│       test_read_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │     374.8ms │        3.0% │    1.55s │     3 │
│    test_read_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │     320.0ms │        5.0% │    1.49s │     3 │
│       test_read_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │       3.27s │        1.8% │   10.37s │     3 │
│    test_read_array[fused_full_threaded-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │       3.31s │        0.7% │   10.40s │     3 │
│                 test_read_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │     269.8ms │        6.1% │    1.20s │     3 │
│              test_read_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │     480.8ms │        2.4% │    1.95s │     3 │
│            test_read_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │     368.2ms │        3.3% │    1.58s │     3 │
│         test_read_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │     562.0ms │        3.0% │    2.20s │     3 │
│         test_read_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │     110.1ms │        0.8% │    0.84s │     3 │
│      test_read_array[fused_single_threaded-latency=0-local-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │     271.1ms │        2.7% │    1.33s │     3 │
│         test_read_array[fused_single_threaded-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │     900.8ms │        0.4% │    3.07s │     3 │
│      test_read_array[fused_single_threaded-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │       2.60s │        3.6% │    9.23s │     3 │
│             test_read_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-repeated] │       3.46s │        2.0% │   10.96s │     3 │
│          test_read_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=None)-zstd-semi_random] │       3.51s │        1.0% │   10.99s │     3 │
│        test_read_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-repeated] │       3.50s │        2.7% │   11.09s │     3 │
│     test_read_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000,))-zstd-semi_random] │       3.48s │        3.1% │   11.22s │     3 │
│     test_read_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-repeated] │     182.4ms │        7.7% │    1.15s │     3 │
│  test_read_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(100000000,), chunks=(100000,), shards=(100000000,))-zstd-semi_random] │     250.7ms │        7.9% │    1.22s │     3 │
│     test_read_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-repeated] │       1.03s │        1.0% │    3.48s │     3 │
│  test_read_array[fused_single_threaded-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random] │       2.39s │        0.8% │    7.62s │     3 │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┴─────────────┴─────────────┴──────────┴───────┘

I will do a full write-up with some graphs (thanks claude) tomorrow but I think the most realistic use-cases from this table are:

-latency=0-local-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random

i.e., local file system, 100 chunks in a shard, 100 shards, data that compresses moderately well (I think it was 2X) and

-latency=0.03-memory-Layout(shape=(1000000000,), chunks=(100000,), shards=(10000000,))-zstd-semi_random

the same, with latency attached.

For the local version, everything makes sense - more threads is faster than both 1 thread and batched i.e., current main pipeline.

With latency, the reading isn't faster with threads than without, but the writing is much faster. I have a suspicion this is due to the MemoryStore as a backing choice but have no hard evidence for that.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@d-v-b d-v-b merged commit d4ed3eb into d-v-b:perf/prepared-write-v2 Jun 24, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants