perf(parquet): eliminate per-value allocation in delta bit-pack decoder#730
Open
zeroshade wants to merge 1 commit intoapache:mainfrom
Open
perf(parquet): eliminate per-value allocation in delta bit-pack decoder#730zeroshade wants to merge 1 commit intoapache:mainfrom
zeroshade wants to merge 1 commit intoapache:mainfrom
Conversation
Optimize deltaBitPackDecoder.unpackNextMini() in two ways: 1. Replace per-value GetValue() calls with GetBatch() using a reused single-element buffer (deltaBuf field). GetValue() allocated a new []uint64 on every call, causing ~2048 heap allocations per block. 2. Add a fast-path for width=0 miniblocks (all deltas identical) that skips bit reading entirely and accumulates minDelta directly. These changes yield ~4x faster delta binary packed decoding.
5aa10f0 to
170e2f8
Compare
lidavidm
approved these changes
Mar 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rationale for this change
The delta bit-pack decoder's
unpackNextMini()method callsBitReader.GetValue()once per value in each miniblock.GetValue()allocates a fresh[]uint64slice on every call.For the default block size of 128 with 4 miniblocks of 32 values each, this causes ~128 heap allocations per block, or ~2048 allocations per 1024-value page. This allocation pressure dominates the decoder's runtime and generates significant GC load.
What changes are included in this PR?
deltaBuf []uint64field to the decoder struct that is allocated once and reused across calls, eliminating the per-value allocations.deltaBitWidth == 0(all deltas are identical, common for sequential or constant data), skip the bit reading entirely and directly accumulateminDeltaAre these changes tested?
Yes, the existing test suite passes along with all encoding and property tests
Are there any user-facing changes?
No user-facing API changes, purely an internal optimization
Benchmark Results (darwin/arm64, Apple M4, Go 1.25)
Baseline (main):
Optimized (this PR):
Summary: