parquet: add Decoder::scan_filtered for miniblock-level predicate pushdown#9788
Open
sahuagin wants to merge 1 commit intoapache:mainfrom
Open
parquet: add Decoder::scan_filtered for miniblock-level predicate pushdown#9788sahuagin wants to merge 1 commit intoapache:mainfrom
sahuagin wants to merge 1 commit intoapache:mainfrom
Conversation
…te pushdown Adds scan_filtered(num_values, out, predicate) as a provided method on the Decoder trait. The default implementation ignores the predicate and decodes everything — safe fallback for all encodings without per-region metadata. DeltaBitPackDecoder overrides it to compute a conservative [lo, hi] range per miniblock from last_value, min_delta, bit_width, and miniblock value count. If the predicate rejects the range the miniblock is skipped without decoding: - bw=0: arithmetic advancement of last_value, no bit reads - terminal bw>0: BitReader::skip, no decode - mid-stream bw>0: decode into scratch to maintain last_value accuracy Returns (values_emitted, values_consumed). Benchmarks vs upstream HEAD: scan_filtered on 1M-row monotone DELTA column: 1.96ms -> 470us (4.2x) Split from apache#9769 as requested by reviewer.
Contributor
Author
|
Note on benchmark variance: These results were collected on a non-isolated machine without CPU frequency pinning. Small variances of ±5% on non-bw=0 paths (particularly |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #9785
Adds
scan_filtered(num_values, out, predicate)as a provided method on theDecodertrait. The method scans up tonum_values, appending tooutonlyvalues from regions where
predicate(lo, hi)returnstrue.Default implementation (all encodings): ignores the predicate, decodes
everything. Safe fallback — no behavioral change for existing decoders.
DeltaBitPackDecoderoverride: Computes a conservative[lo, hi]range perminiblock from
last_value,min_delta,bit_width, and miniblock value count.If the predicate rejects the range, the miniblock is skipped without decoding
individual values. Three skip strategies depending on context:
bw=0: arithmetic advancement oflast_value, no bit reads.bw>0:BitReader::skip, no decode.bw>0: decode into scratch buffer to maintainlast_valueaccuracy for subsequent miniblock range checks.
The predicate contract is conservative:
falsemeans the region definitelycannot match (safe to skip);
truemeans it might match (decode and emit).False positives are safe. False negatives are not permitted by implementations.
Benchmarks (
arrow_readerbench vs upstream HEAD, combined with #9786 and #9787):Note:
scan_filteredin isolation shows smaller gains since it does not havethe bw=0 (#9786) and terminal-skip (#9787) optimizations underneath it. The
numbers above reflect the combined state, which is the intended deployment.
Benchmarks were run on a non-isolated machine (no CPU frequency pinning);
small variances of ±5% on non-bw=0 paths should be attributed to measurement
noise.
Tests added:
get()).Note on API surface:
scan_filteredis a provided method with a safedefault, so adding it is non-breaking. Encodings that don't have per-region
metadata (PLAIN, RLE, etc.) get the correct conservative behavior for free.
Generated-by: Claude (claude-sonnet-4-6)