[PoC] Filter pushdown selectivity threshold#9414
Draft
Dandandan wants to merge 7 commits intoapache:mainfrom
Draft
[PoC] Filter pushdown selectivity threshold#9414Dandandan wants to merge 7 commits intoapache:mainfrom
Dandandan wants to merge 7 commits intoapache:mainfrom
Conversation
Previously, every predicate in the RowFilter received the same ProjectionMask containing ALL filter columns. This caused unnecessary decoding of expensive string columns when evaluating cheap integer predicates. Now each predicate receives a mask with only the single column it needs. Key sync improvements (vs baseline): - Q37: 63.7ms -> 7.3ms (-88.6%, Title LIKE with CounterID=62 filter) - Q36: 117ms -> 24ms (-79.5%, URL <> '' with CounterID=62 filter) - Q40: 17.9ms -> 5.1ms (-71.5%, multi-pred with RefererHash eq) - Q41: 17.3ms -> 5.5ms (-68.1%, multi-pred with URLHash eq) - Q22: 303ms -> 127ms (-58.2%, 3 string predicates) - Q42: 7.6ms -> 3.9ms (-48.5%, int-only multi-predicate) - Q38: 19.1ms -> 12.4ms (-34.9%, 5 int predicates) - Q21: 159ms -> 98ms (-38.5%, URL LIKE + SearchPhrase) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use page-level min/max statistics (via StatisticsConverter) to compute a RowSelection that skips pages where equality predicates cannot match. For each equality predicate with an integer literal, we check if the literal falls within each page's [min, max] range and skip pages where it doesn't. Impact is data-dependent - most effective when data is sorted/clustered by the filter column. For this particular 100K-row sample file the data isn't sorted by filter columns, so improvements are modest (~5% for some CounterID=62 queries). Would show larger gains on sorted datasets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Put the cheapest/most selective predicate first: SearchPhrase <> '' filters ~87% of rows before expensive LIKE predicates run. This reduces string column decoding for Title and URL significantly. Q22 sync: ~6% improvement, Q22 async: ~13% improvement. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
Author
|
run benchmark arrow_reader_clickbench |
2 tasks
|
🤖 |
|
🤖: Benchmark completed Details
|
Contributor
Author
Contributor
Author
|
run benchmark arrow_reader_clickbench |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
It can be better to altogether skip (combined) filters with low effectivity - as there still will be overhead of individual (small) skip/read during Parquet decoder.
What changes are included in this PR?
This adds a simple threshold to skip pushing down if the current selection is not "effective", i.e. under a fraction of rows
Are these changes tested?
Are there any user-facing changes?