bench(parquet): make arrow_writer benchmarks allocation-order stable#10068
bench(parquet): make arrow_writer benchmarks allocation-order stable#10068adriangb wants to merge 1 commit into
Conversation
|
run benchmark arrow_writer |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing bench-arrow-writer-stable-allocator (e2c785a) to fd1c5b3 (merge-base) diff File an issue against this benchmark runner |
The `arrow_writer` benchmarks rebuild a fresh `ArrowWriter` every criterion iteration, so the writer's internal encode buffers are allocated and freed each iteration. With a page-decaying allocator those buffers are served from fresh (un-faulted) pages whenever earlier benchmarks in the same process have churned the heap, so every iteration pays a per-page minor fault. That roughly doubles the measured time for the byte-array writers and makes the result depend on benchmark order: e.g. `string/parquet_2` swings between ~106ms and ~190ms with no code change, purely on what ran before it. This is the source of the spurious ~1.2-1.75x deltas seen on the criterion bench bot (a main-vs- main run reproduced an 18% delta). Use jemalloc as the bench's global allocator with page decay disabled (`dirty_decay_ms:-1,muzzy_decay_ms:-1`), so freed pages stay mapped and are reused warm instead of being returned to the OS. This removes the per-iteration fault tax and collapses the bimodality (stable ~106ms across orderings here). The decay policy is pinned via a compiled-in `malloc_conf` symbol rather than an allocator default, and `assert_page_decay_disabled` checks at startup (via `tikv-jemalloc-ctl`) that it actually took effect, so a silently-ignored config fails loudly instead of quietly reintroducing the instability. Gated to Linux: jemalloc does not build on some targets (e.g. wasm, msvc) and its unprefixed `malloc_conf` symbol is not honored on others (e.g. apple, android); elsewhere the bench falls back to the default allocator. Linux is where the canonical benchmark runner runs. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
e2c785a to
f946549
Compare
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
Thanks @adriangb -- I this this PR is a good proof that soem of the benchmarks are sensitive to allocation / deallocation order and allocator state
However, I don't think we should merge this as it will make interpreting other benchmark results hard (e.g there is even more chance of divergence between linux and macosx)
| #[macro_use] | ||
| extern crate criterion; | ||
|
|
||
| // Use jemalloc, with page decay disabled, for the writer benchmarks. |
There was a problem hiding this comment.
I think this might be overfitting -- using jemalloc (vs other allocators)
I do't have any way to really evaluate the implications of switching allocators, and all these jemalloc tuning knobs. I fear they will have other hard to understand side effects,
I am happy to have us conclude that any particular benchmark run is allocation page heavy and thus we can ignore the effects
There was a problem hiding this comment.
The reason to use jemallloc is not because of the allocator itself, rather because it's very configurable and we need the configuration.
Hmm I don't think comparing raw numbers on linux vs. macos or even stability is a thing anyone should be doing anyway, this just makes intra platform runs are stable. But happy to close this if you don't think it's the right path forward, it can always be resurrected. |
Which issue does this PR close?
Follow-up to benchmark noise observed on the criterion bench bot (e.g. on #9972), where
string/parquet_2reported a ~1.75x "regression" that was not reproducible and not present in instruction counts.Rationale for this change
The
arrow_writerbenchmarks build a freshArrowWriterevery criterion iteration, so the writer's internal encode buffers are allocated and freed on each iteration. With a page-decaying allocator (glibc default, jemalloc default), those buffers are served from fresh, un-faulted pages whenever earlier benchmarks in the same process have churned the heap — so each iteration pays a per-page minor page fault on every byte written.That fault tax roughly doubles the measured time for the byte-array writers and makes the result depend on benchmark order. On the same hardware as the bench bot (Neoverse-V2), the same
mainbinary produces:string/parquet_2primitivegroupThis is the source of the spurious bench-bot deltas: a
main-vs-maincontrol run (identical code on both sides) reproduced an 18% difference onstring/parquet_2, and a larger draw produced the original ~1.75x. The work done is identical (instruction count differs by ~0.25% for the change that triggered the investigation) — only the page-fault state differs.Diagnosis details: the slow basin shows ~5M minor faults vs ~763K in the fast basin; forcing every buffer onto fresh pages (
MALLOC_MMAP_THRESHOLD_low) pins it slow, and disabling page decay pins it fast.What changes are included in this PR?
Use jemalloc as the
arrow_writerbench's global allocator with page decay disabled (dirty_decay_ms:-1,muzzy_decay_ms:-1), so freed pages stay mapped and are reused warm instead of being returned to the OS. This removes the per-iteration fault tax and collapses the order-dependent bimodality:string/parquet_2primitivestringgroupNotes on robustness (this came up in review):
malloc_confsymbol — so it does not silently change if the allocator updates its defaults.malloc_confsymbol when built withunprefixed_malloc_on_supported_platforms; without it the symbol is silently ignored. To make that failure mode loud,assert_page_decay_disabled()readsopt.dirty_decay_ms/opt.muzzy_decay_msat startup (viatikv-jemalloc-ctl) and panics if the policy is not actually-1, with a hint. This was verified to fire when the feature is removed.Scope: the allocator only affects the
arrow_writerbenchmark binary; no library code changes.Are there any user-facing changes?
No. Benchmark-only change (dev-dependencies + the
arrow_writerbench).