Skip per-row filter evaluation when all row groups are fully matched#21372
Skip per-row filter evaluation when all row groups are fully matched#21372Dandandan wants to merge 1 commit intoapache:mainfrom
Conversation
When statistics prove that every remaining row group fully satisfies the filter predicate, skip attaching the row filter to the Parquet decoder entirely. This avoids unnecessary per-row filter evaluation for queries like `WHERE col <> 0` or `WHERE col <> ''` when min/max statistics show the filter is trivially true for all row groups. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing skip-filter-fully-matched (ebe07a1) to c17c87c (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing skip-filter-fully-matched (ebe07a1) to c17c87c (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing skip-filter-fully-matched (ebe07a1) to c17c87c (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing skip-filter-fully-matched (ebe07a1) to c17c87c (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing skip-filter-fully-matched (ebe07a1) to c17c87c (merge-base) diff using: tpch File an issue against this benchmark runner |
|
Hey @Dandandan , I think the changs should be mainly in arrow-rs repo, I did some work before xudong963/arrow-rs@eaf4ab1 |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing skip-filter-fully-matched (ebe07a1) to c17c87c (merge-base) diff using: tpcds File an issue against this benchmark runner |
Ah yes, I realize we could also do that at arrow-rs side! |
yes, I'll make a PR recently |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
This makes a lot of sense to me. We have a hacky version of this internally, it's especially effective for filters/queries like `ts > '2026-04-05T00:15:00Z' where many files will have all rows match. Can't we do this at the file level right now in DataFusion (in fact I think we do) and once the morsels API is across the line at the row group level without any changes in arrow-rs? |
Which issue does this PR close?
N/A - Performance optimization
Rationale for this change
What changes are included in this PR?
After all row group pruning (statistics, bloom filters, limit), check if every remaining row group is fully matched by the predicate. If so, drop the per-row filter from the Parquet decoder builder entirely.
Are these changes tested?
Existing parquet integration tests pass (198 tests). The optimization is transparent — it produces the same results, just avoids redundant filter evaluation.
Are there any user-facing changes?
No. This is a performance optimization that skips unnecessary work. Query results are unchanged.
🤖 Generated with Claude Code