Skip to content

bench(parquet): make arrow_writer benchmarks allocation-order stable#10068

Closed
adriangb wants to merge 1 commit into
apache:mainfrom
pydantic:bench-arrow-writer-stable-allocator
Closed

bench(parquet): make arrow_writer benchmarks allocation-order stable#10068
adriangb wants to merge 1 commit into
apache:mainfrom
pydantic:bench-arrow-writer-stable-allocator

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

@adriangb adriangb commented Jun 4, 2026

Which issue does this PR close?

Follow-up to benchmark noise observed on the criterion bench bot (e.g. on #9972), where string/parquet_2 reported a ~1.75x "regression" that was not reproducible and not present in instruction counts.

Rationale for this change

The arrow_writer benchmarks build a fresh ArrowWriter every criterion iteration, so the writer's internal encode buffers are allocated and freed on each iteration. With a page-decaying allocator (glibc default, jemalloc default), those buffers are served from fresh, un-faulted pages whenever earlier benchmarks in the same process have churned the heap — so each iteration pays a per-page minor page fault on every byte written.

That fault tax roughly doubles the measured time for the byte-array writers and makes the result depend on benchmark order. On the same hardware as the bench bot (Neoverse-V2), the same main binary produces:

string/parquet_2 time
run in isolation ~106 ms
run after the primitive group ~187 ms

This is the source of the spurious bench-bot deltas: a main-vs-main control run (identical code on both sides) reproduced an 18% difference on string/parquet_2, and a larger draw produced the original ~1.75x. The work done is identical (instruction count differs by ~0.25% for the change that triggered the investigation) — only the page-fault state differs.

Diagnosis details: the slow basin shows ~5M minor faults vs ~763K in the fast basin; forcing every buffer onto fresh pages (MALLOC_MMAP_THRESHOLD_ low) pins it slow, and disabling page decay pins it fast.

What changes are included in this PR?

Use jemalloc as the arrow_writer bench's global allocator with page decay disabled (dirty_decay_ms:-1,muzzy_decay_ms:-1), so freed pages stay mapped and are reused warm instead of being returned to the OS. This removes the per-iteration fault tax and collapses the order-dependent bimodality:

string/parquet_2 isolated after primitive after string group
before (system alloc) 106 ms 187 ms 106 ms
after (this PR) ~106 ms ~107 ms ~106 ms

Notes on robustness (this came up in review):

  • The decay policy is pinned by the benchmark, not left to an allocator default — via a compiled-in malloc_conf symbol — so it does not silently change if the allocator updates its defaults.
  • jemalloc only reads the unprefixed malloc_conf symbol when built with unprefixed_malloc_on_supported_platforms; without it the symbol is silently ignored. To make that failure mode loud, assert_page_decay_disabled() reads opt.dirty_decay_ms / opt.muzzy_decay_ms at startup (via tikv-jemalloc-ctl) and panics if the policy is not actually -1, with a hint. This was verified to fire when the feature is removed.

Scope: the allocator only affects the arrow_writer benchmark binary; no library code changes.

Are there any user-facing changes?

No. Benchmark-only change (dev-dependencies + the arrow_writer bench).

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Jun 4, 2026
@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented Jun 4, 2026

run benchmark arrow_writer

@adriangb adriangb marked this pull request as ready for review June 4, 2026 12:41
@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4622231901-434-85tkc 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing bench-arrow-writer-stable-allocator (e2c785a) to fd1c5b3 (merge-base) diff
BENCH_NAME=arrow_writer
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench arrow_writer
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

The `arrow_writer` benchmarks rebuild a fresh `ArrowWriter` every criterion
iteration, so the writer's internal encode buffers are allocated and freed
each iteration. With a page-decaying allocator those buffers are served from
fresh (un-faulted) pages whenever earlier benchmarks in the same process have
churned the heap, so every iteration pays a per-page minor fault. That roughly
doubles the measured time for the byte-array writers and makes the result
depend on benchmark order: e.g. `string/parquet_2` swings between ~106ms and
~190ms with no code change, purely on what ran before it. This is the source
of the spurious ~1.2-1.75x deltas seen on the criterion bench bot (a main-vs-
main run reproduced an 18% delta).

Use jemalloc as the bench's global allocator with page decay disabled
(`dirty_decay_ms:-1,muzzy_decay_ms:-1`), so freed pages stay mapped and are
reused warm instead of being returned to the OS. This removes the per-iteration
fault tax and collapses the bimodality (stable ~106ms across orderings here).

The decay policy is pinned via a compiled-in `malloc_conf` symbol rather than
an allocator default, and `assert_page_decay_disabled` checks at startup (via
`tikv-jemalloc-ctl`) that it actually took effect, so a silently-ignored config
fails loudly instead of quietly reintroducing the instability.

Gated to Linux: jemalloc does not build on some targets (e.g. wasm, msvc) and
its unprefixed `malloc_conf` symbol is not honored on others (e.g. apple,
android); elsewhere the bench falls back to the default allocator. Linux is
where the canonical benchmark runner runs.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                              bench-arrow-writer-stable-allocator    main
-----                                              -----------------------------------    ----
bool/bloom_filter                                  1.00     13.1±0.06ms    19.1 MB/sec    1.00     13.1±0.04ms    19.1 MB/sec
bool/cdc                                           1.00     15.7±0.05ms    15.9 MB/sec    1.00     15.7±0.06ms    15.9 MB/sec
bool/default                                       1.00     10.9±0.04ms    22.9 MB/sec    1.01     11.0±0.04ms    22.7 MB/sec
bool/parquet_2                                     1.00     14.7±0.04ms    17.0 MB/sec    1.00     14.7±0.04ms    17.0 MB/sec
bool/zstd                                          1.00     11.4±0.04ms    21.9 MB/sec    1.00     11.5±0.04ms    21.8 MB/sec
bool/zstd_parquet_2                                1.00     15.0±0.04ms    16.6 MB/sec    1.00     15.1±0.04ms    16.6 MB/sec
bool_non_null/bloom_filter                         1.00      7.1±0.03ms    17.7 MB/sec    1.00      7.1±0.02ms    17.6 MB/sec
bool_non_null/cdc                                  1.00      6.8±0.03ms    18.3 MB/sec    1.01      6.9±0.02ms    18.2 MB/sec
bool_non_null/default                              1.00      4.3±0.02ms    28.9 MB/sec    1.00      4.3±0.02ms    28.8 MB/sec
bool_non_null/parquet_2                            1.00      9.1±0.03ms    13.7 MB/sec    1.00      9.1±0.04ms    13.7 MB/sec
bool_non_null/zstd                                 1.00      4.7±0.03ms    26.8 MB/sec    1.01      4.7±0.02ms    26.7 MB/sec
bool_non_null/zstd_parquet_2                       1.00      9.5±0.04ms    13.1 MB/sec    1.00      9.5±0.04ms    13.1 MB/sec
float_with_nans/bloom_filter                       1.00     88.5±0.38ms   158.2 MB/sec    1.05     93.0±0.36ms   150.6 MB/sec
float_with_nans/cdc                                1.00     77.6±0.26ms   180.4 MB/sec    1.05     81.8±0.25ms   171.1 MB/sec
float_with_nans/default                            1.00     70.5±0.33ms   198.7 MB/sec    1.06     74.4±0.24ms   188.1 MB/sec
float_with_nans/parquet_2                          1.00     88.1±0.44ms   159.0 MB/sec    1.08     94.7±0.37ms   147.9 MB/sec
float_with_nans/zstd                               1.00    108.7±0.31ms   128.7 MB/sec    1.03    112.0±0.22ms   125.0 MB/sec
float_with_nans/zstd_parquet_2                     1.00    125.0±0.46ms   112.0 MB/sec    1.05    131.8±0.39ms   106.2 MB/sec
list_primitive/bloom_filter                        1.00    322.2±0.93ms  1692.5 MB/sec    1.03    332.8±1.60ms  1638.5 MB/sec
list_primitive/cdc                                 1.00    359.7±0.85ms  1516.0 MB/sec    1.01    364.4±1.48ms  1496.5 MB/sec
list_primitive/default                             1.00    247.5±0.65ms     2.2 GB/sec    1.03    254.3±1.82ms     2.1 GB/sec
list_primitive/parquet_2                           1.01   277.9±36.54ms  1962.8 MB/sec    1.00    275.7±0.74ms  1978.4 MB/sec
list_primitive/zstd                                1.00    497.1±2.02ms  1097.1 MB/sec    1.02    505.5±2.48ms  1078.8 MB/sec
list_primitive/zstd_parquet_2                      1.00    498.2±0.70ms  1094.7 MB/sec    1.00    499.4±1.73ms  1092.0 MB/sec
list_primitive_non_null/bloom_filter               1.00    353.5±3.96ms  1539.4 MB/sec    1.11    391.1±4.55ms  1391.7 MB/sec
list_primitive_non_null/cdc                        1.00    430.5±7.33ms  1264.3 MB/sec    1.02    438.6±6.91ms  1240.8 MB/sec
list_primitive_non_null/default                    1.00    227.2±3.56ms     2.3 GB/sec    1.15    261.0±3.70ms     2.0 GB/sec
list_primitive_non_null/parquet_2                  1.00    272.6±0.47ms  1996.7 MB/sec    1.08    293.1±1.49ms  1856.9 MB/sec
list_primitive_non_null/zstd                       1.00    660.4±2.52ms   824.2 MB/sec    1.04    683.5±4.19ms   796.3 MB/sec
list_primitive_non_null/zstd_parquet_2             1.00    669.5±2.18ms   812.9 MB/sec    1.00    668.9±2.05ms   813.7 MB/sec
list_primitive_sparse_99pct_null/bloom_filter      1.00     11.5±0.09ms     3.2 GB/sec    1.00     11.5±0.06ms     3.2 GB/sec
list_primitive_sparse_99pct_null/cdc               1.00     22.8±0.11ms  1639.1 MB/sec    1.01     23.0±0.11ms  1624.7 MB/sec
list_primitive_sparse_99pct_null/default           1.00     10.9±0.04ms     3.4 GB/sec    1.03     11.2±0.06ms     3.3 GB/sec
list_primitive_sparse_99pct_null/parquet_2         1.00     11.1±0.07ms     3.3 GB/sec    1.01     11.2±0.04ms     3.3 GB/sec
list_primitive_sparse_99pct_null/zstd              1.00     12.8±0.08ms     2.9 GB/sec    1.02     13.0±0.07ms     2.8 GB/sec
list_primitive_sparse_99pct_null/zstd_parquet_2    1.00     11.4±0.06ms     3.2 GB/sec    1.00     11.4±0.06ms     3.2 GB/sec
primitive/bloom_filter                             1.03    153.4±0.54ms   292.6 MB/sec    1.00    149.2±0.53ms   300.8 MB/sec
primitive/cdc                                      1.01    160.1±0.66ms   280.4 MB/sec    1.00    158.7±0.53ms   282.8 MB/sec
primitive/default                                  1.01    118.9±0.38ms   377.3 MB/sec    1.00    117.9±0.36ms   380.6 MB/sec
primitive/parquet_2                                1.01    134.1±0.46ms   334.5 MB/sec    1.00    132.8±0.34ms   338.0 MB/sec
primitive/zstd                                     1.01    149.3±0.38ms   300.6 MB/sec    1.00    147.3±0.38ms   304.6 MB/sec
primitive/zstd_parquet_2                           1.01    167.4±0.33ms   268.1 MB/sec    1.00    166.2±0.42ms   270.1 MB/sec
primitive_all_null/bloom_filter                    1.00    837.9±3.55µs    52.3 GB/sec    1.07    893.2±2.37µs    49.1 GB/sec
primitive_all_null/cdc                             1.00     19.6±0.29ms     2.2 GB/sec    1.00     19.6±0.33ms     2.2 GB/sec
primitive_all_null/default                         1.00    239.4±1.19µs   183.1 GB/sec    1.16    278.1±0.78µs   157.6 GB/sec
primitive_all_null/parquet_2                       1.00    237.9±1.02µs   184.2 GB/sec    1.18    279.8±1.28µs   156.6 GB/sec
primitive_all_null/zstd                            1.00    354.1±1.63µs   123.8 GB/sec    1.10    390.5±0.78µs   112.2 GB/sec
primitive_all_null/zstd_parquet_2                  1.00    314.7±1.07µs   139.2 GB/sec    1.14    357.3±1.24µs   122.7 GB/sec
primitive_non_null/bloom_filter                    1.04    111.3±0.53ms   395.5 MB/sec    1.00    106.6±0.36ms   412.7 MB/sec
primitive_non_null/cdc                             1.01     91.4±0.27ms   481.3 MB/sec    1.00     90.1±0.32ms   488.1 MB/sec
primitive_non_null/default                         1.01     68.6±0.51ms   641.7 MB/sec    1.00     67.6±0.29ms   651.3 MB/sec
primitive_non_null/parquet_2                       1.02     90.8±0.33ms   484.6 MB/sec    1.00     89.3±0.24ms   493.0 MB/sec
primitive_non_null/zstd                            1.00    101.8±6.70ms   432.2 MB/sec    1.03    104.7±1.06ms   420.4 MB/sec
primitive_non_null/zstd_parquet_2                  1.00    124.6±0.30ms   353.1 MB/sec    1.03    128.7±2.53ms   341.8 MB/sec
primitive_sparse_99pct_null/bloom_filter           1.01     18.6±0.21ms     2.4 GB/sec    1.00     18.4±0.21ms     2.4 GB/sec
primitive_sparse_99pct_null/cdc                    1.00     37.0±0.28ms  1212.7 MB/sec    1.01     37.3±0.24ms  1204.0 MB/sec
primitive_sparse_99pct_null/default                1.00     16.6±0.08ms     2.6 GB/sec    1.01     16.8±0.09ms     2.6 GB/sec
primitive_sparse_99pct_null/parquet_2              1.00     16.8±0.06ms     2.6 GB/sec    1.00     16.8±0.08ms     2.6 GB/sec
primitive_sparse_99pct_null/zstd                   1.00     20.0±0.06ms     2.2 GB/sec    1.01     20.2±0.10ms     2.2 GB/sec
primitive_sparse_99pct_null/zstd_parquet_2         1.00     18.6±0.10ms     2.4 GB/sec    1.01     18.7±0.07ms     2.3 GB/sec
string/bloom_filter                                1.00    171.1±0.74ms     3.0 GB/sec    1.30   222.4±21.59ms     2.3 GB/sec
string/cdc                                         1.00    215.6±4.34ms     2.4 GB/sec    1.03    221.5±4.69ms     2.3 GB/sec
string/default                                     1.00     91.1±3.12ms     5.6 GB/sec    1.39   126.4±20.53ms     4.1 GB/sec
string/parquet_2                                   1.00    104.7±0.79ms     4.9 GB/sec    1.07    112.4±6.67ms     4.6 GB/sec
string/zstd                                        1.00    405.8±1.13ms  1291.8 MB/sec    1.03    417.7±2.20ms  1254.9 MB/sec
string/zstd_parquet_2                              1.00    395.6±1.27ms  1325.2 MB/sec    1.02    403.6±6.87ms  1299.0 MB/sec
string_and_binary_view/bloom_filter                1.00     64.3±0.28ms   501.5 MB/sec    1.02     65.6±0.23ms   491.9 MB/sec
string_and_binary_view/cdc                         1.00     58.0±0.13ms   556.3 MB/sec    1.01     58.3±0.19ms   553.2 MB/sec
string_and_binary_view/default                     1.00     47.6±0.13ms   677.3 MB/sec    1.01     48.2±0.12ms   669.2 MB/sec
string_and_binary_view/parquet_2                   1.00     58.4±0.14ms   552.4 MB/sec    1.01     59.0±0.13ms   546.5 MB/sec
string_and_binary_view/zstd                        1.00     84.4±0.15ms   382.0 MB/sec    1.00     84.7±0.14ms   380.6 MB/sec
string_and_binary_view/zstd_parquet_2              1.00     72.3±0.13ms   446.0 MB/sec    1.00     72.6±0.15ms   444.3 MB/sec
string_dictionary/bloom_filter                     1.00     85.7±0.31ms     3.0 GB/sec    1.06     90.9±0.99ms     2.8 GB/sec
string_dictionary/cdc                              1.00     50.7±0.35ms     5.1 GB/sec    1.04     52.9±0.94ms     4.9 GB/sec
string_dictionary/default                          1.00     45.6±0.20ms     5.7 GB/sec    1.03     46.8±0.79ms     5.5 GB/sec
string_dictionary/parquet_2                        1.00     53.5±0.55ms     4.8 GB/sec    1.02     54.5±0.26ms     4.7 GB/sec
string_dictionary/zstd                             1.00    202.5±0.43ms  1304.3 MB/sec    1.04    209.7±1.56ms  1259.3 MB/sec
string_dictionary/zstd_parquet_2                   1.00    198.8±0.18ms  1328.8 MB/sec    1.00    199.1±0.24ms  1326.7 MB/sec
string_non_null/bloom_filter                       1.00    205.7±1.08ms     2.5 GB/sec    1.23   252.0±14.04ms     2.0 GB/sec
string_non_null/cdc                                1.00    250.3±2.96ms     2.0 GB/sec    1.06    266.1±2.67ms  1968.9 MB/sec
string_non_null/default                            1.00     98.6±0.20ms     5.2 GB/sec    1.40   138.1±12.87ms     3.7 GB/sec
string_non_null/parquet_2                          1.00    120.8±0.64ms     4.2 GB/sec    1.09    131.5±3.36ms     3.9 GB/sec
string_non_null/zstd                               1.00    517.4±3.24ms  1012.8 MB/sec    1.04    536.7±2.44ms   976.3 MB/sec
string_non_null/zstd_parquet_2                     1.00    505.2±1.31ms  1037.1 MB/sec    1.00    506.2±1.35ms  1035.2 MB/sec
struct_all_null/bloom_filter                       1.00    348.6±1.95µs    45.2 GB/sec    1.07    374.1±0.96µs    42.1 GB/sec
struct_all_null/cdc                                1.01      8.4±0.21ms  1918.8 MB/sec    1.00      8.3±0.14ms  1944.0 MB/sec
struct_all_null/default                            1.00    101.0±0.47µs   155.9 GB/sec    1.19    120.4±0.25µs   130.7 GB/sec
struct_all_null/parquet_2                          1.00    100.0±0.45µs   157.5 GB/sec    1.21    121.0±0.34µs   130.1 GB/sec
struct_all_null/zstd                               1.00    150.9±0.54µs   104.3 GB/sec    1.12    168.7±0.32µs    93.4 GB/sec
struct_all_null/zstd_parquet_2                     1.00    132.8±0.41µs   118.5 GB/sec    1.16    154.2±0.58µs   102.1 GB/sec
struct_non_null/bloom_filter                       1.00     46.5±0.16ms   344.1 MB/sec    1.00     46.4±0.20ms   344.6 MB/sec
struct_non_null/cdc                                1.01     45.7±0.15ms   350.1 MB/sec    1.00     45.3±0.15ms   353.0 MB/sec
struct_non_null/default                            1.01     32.1±0.10ms   497.9 MB/sec    1.00     31.8±0.11ms   503.7 MB/sec
struct_non_null/parquet_2                          1.02     41.1±0.13ms   389.0 MB/sec    1.00     40.5±0.10ms   395.0 MB/sec
struct_non_null/zstd                               1.00     41.2±0.14ms   388.5 MB/sec    1.00     41.1±0.10ms   388.9 MB/sec
struct_non_null/zstd_parquet_2                     1.01     55.0±0.15ms   290.7 MB/sec    1.00     54.7±0.13ms   292.3 MB/sec
struct_sparse_99pct_null/bloom_filter              1.00      6.4±0.07ms     2.5 GB/sec    1.00      6.4±0.03ms     2.5 GB/sec
struct_sparse_99pct_null/cdc                       1.00     14.2±0.12ms  1135.7 MB/sec    1.01     14.3±0.10ms  1125.3 MB/sec
struct_sparse_99pct_null/default                   1.00      5.8±0.02ms     2.7 GB/sec    1.02      5.9±0.02ms     2.7 GB/sec
struct_sparse_99pct_null/parquet_2                 1.00      5.7±0.03ms     2.7 GB/sec    1.03      5.9±0.02ms     2.7 GB/sec
struct_sparse_99pct_null/zstd                      1.00      7.2±0.04ms     2.2 GB/sec    1.02      7.3±0.03ms     2.2 GB/sec
struct_sparse_99pct_null/zstd_parquet_2            1.00      6.5±0.03ms     2.4 GB/sec    1.02      6.7±0.02ms     2.4 GB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 1920.4s
Peak memory 6.7 GiB
Avg memory 6.6 GiB
CPU user 1885.3s
CPU sys 32.6s
Peak spill 0 B

branch

Metric Value
Wall time 1880.4s
Peak memory 21.5 GiB
Avg memory 14.7 GiB
CPU user 1873.2s
CPU sys 5.0s
Peak spill 0 B

File an issue against this benchmark runner

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb -- I this this PR is a good proof that soem of the benchmarks are sensitive to allocation / deallocation order and allocator state

However, I don't think we should merge this as it will make interpreting other benchmark results hard (e.g there is even more chance of divergence between linux and macosx)

#[macro_use]
extern crate criterion;

// Use jemalloc, with page decay disabled, for the writer benchmarks.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be overfitting -- using jemalloc (vs other allocators)

I do't have any way to really evaluate the implications of switching allocators, and all these jemalloc tuning knobs. I fear they will have other hard to understand side effects,

I am happy to have us conclude that any particular benchmark run is allocation page heavy and thus we can ignore the effects

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to use jemallloc is not because of the allocator itself, rather because it's very configurable and we need the configuration.

@adriangb
Copy link
Copy Markdown
Contributor Author

adriangb commented Jun 4, 2026

However, I don't think we should merge this as it will make interpreting other benchmark results hard (e.g there is even more chance of divergence between linux and macosx)

Hmm I don't think comparing raw numbers on linux vs. macos or even stability is a thing anyone should be doing anyway, this just makes intra platform runs are stable.

But happy to close this if you don't think it's the right path forward, it can always be resurrected.

@adriangb adriangb closed this Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants