feat(parquet): skip RowFilter evaluation for fully matched row groups by xudong963 · Pull Request #9694 · apache/arrow-rs

xudong963 · 2026-04-13T03:40:48Z

Which issue does this PR close?

Related to Do not evaluate predicates if they can be proven to be false datafusion#19028

Rationale for this change

When DataFusion evaluates a Parquet scan with filter pushdown, it uses row group statistics to determine which row groups to scan. In many real-world queries, the predicate matches all rows in some (or all) row groups — for example, a time-range filter where entire row groups fall within the range, or a WHERE status != 'DELETED' filter on data that contains no deleted rows.

Today, even when row group statistics prove that every row satisfies the predicate, the RowFilter is still evaluated row-by-row during decoding. This means the filter columns are decoded and the predicate expression is evaluated for every row — work that produces no useful filtering and can be expensive, especially when filter columns are large (e.g., strings) or the predicate is complex.

This PR adds a mechanism to skip RowFilter evaluation entirely for row groups that are known to be "fully matched" based on statistics. The caller (e.g., DataFusion) determines which row groups are fully matched during row group pruning and passes that information to the reader builder. During decoding, fully matched row groups skip straight to data materialization, bypassing filter column decoding and predicate evaluation.

What changes are included in this PR?

New builder method with_fully_matched_row_groups(Vec<usize>) on ArrowReaderBuilder — allows callers to specify which row groups have all rows matching the filter predicate.
Skip filter in RowGroupReaderBuilder::try_transition() — when a row group is in the fully-matched set, the Start state transitions directly to StartData, bypassing the Filters / WaitingOnFilterData states entirely. The filter is preserved (put back into self.filter) for subsequent non-fully-matched row groups.
Plumbed through all decoder paths — the field is propagated through ParquetPushDecoderBuilder, ParquetRecordBatchStreamBuilder (async), and ignored in the sync reader (which processes one row group at a time).

Design choices:

The fully-matched set is stored as a HashSet<usize> on RowGroupReaderBuilder for O(1) lookup, rather than on RowGroupDecoderState, so the state enum size is unchanged (preserving the existing 200-byte size test).
The API uses Option<Vec<usize>> at the builder level and converts to HashSet internally.

Are these changes tested?

The optimization is exercised by an end-to-end benchmark in DataFusion that uses ParquetPushDecoder directly (the same code path used by DataFusion's async Parquet opener). The benchmark verifies correctness by asserting the expected row count.

Unit tests can be added if reviewers prefer — happy to add tests that verify:

Fully matched row groups skip filter evaluation and still return all rows
Non-fully-matched row groups in the same scan still have filters applied
The API is a no-op when no fully matched row groups are specified

(I'll open a draft PR in datafusion side tomorrow)

Are there any user-facing changes?

Yes — a new public method ArrowReaderBuilder::with_fully_matched_row_groups() is added. This is a purely additive, non-breaking change. Existing code is unaffected since the default is None (no row groups are marked as fully matched).

…ully matched row groups When row group statistics prove that all rows in a row group satisfy the filter predicate, the RowFilter evaluation can be skipped entirely for those row groups. This avoids the cost of decoding filter columns and evaluating the predicate expression. Adds `with_fully_matched_row_groups(Vec<usize>)` to ArrowReaderBuilder which flows through to RowGroupReaderBuilder. When processing a fully matched row group, the Start state transitions directly to StartData, bypassing all filter evaluation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When row group statistics prove that ALL rows satisfy the filter predicate, skip both RowFilter evaluation (late materialization) and page index pruning for those row groups. This avoids wasted work decoding filter columns and evaluating predicates that produce no useful filtering. Depends on apache/arrow-rs#9694 for the `with_fully_matched_row_groups()` builder API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

xudong963 · 2026-04-15T07:49:35Z

I've opened a related PR in DataFusion, which is using my arrow-rs fork version.

If we think the two PRs are in the right direction, then we can review the arrow-rs side PR first, after it's merged and lands in DF, we can continue to change the arrow-rs version in DF PR, and merge the DF PR

alamb

Thank you @xudong963 -- this is a neat idea and I think the idea of not evaluating pushdown fulters when DataFusion can prove they won't filter things out is totally the right way to go

However, I am not sure this is the right level to do this filtering. I think it might keep the APIs simpler if the push down filter removal happens in DataFusion itself -- for example, DataFusion could make different ParquetPushDecoders for each row group, and the ones where the filters don't filter anything can be disabled

IN the context of the "morsel" work -- I think we are heading towards a scan in DataFusion where each RowGroup (or collection of row groups) is a morsel and then we can treat the morsels individually for IO, pruning, and even moving around cores.

Any chance you want to help explore that option in DataFusion?

xudong963 · 2026-04-16T04:00:21Z

I think it might keep the APIs simpler if the push down filter removal happens in DataFusion itself -- for example, DataFusion could make different ParquetPushDecoders for each row group, and the ones where the filters don't filter anything can be disabled

ahah, this is cleaner, I'd like to have a try

When row group statistics prove that ALL rows satisfy the filter predicate, skip both RowFilter evaluation (late materialization) and page index pruning for those row groups. This avoids wasted work decoding filter columns and evaluating predicates that produce no useful filtering. Depends on apache/arrow-rs#9694 for the `with_fully_matched_row_groups()` builder API. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

xudong963 · 2026-04-16T08:08:46Z

apache/datafusion@d6c3879 made a refactor on DF side, which doesn't need the changes from arrow side

alamb · 2026-04-16T10:22:28Z

apache/datafusion@d6c3879 made a refactor on DF side, which doesn't need the changes from arrow side

This is neat.

What do you think about the idea of an API to "split" decoders or something 🤔

github-actions bot added the parquet Changes to the parquet crate label Apr 13, 2026

xudong963 force-pushed the skip-filter-fully-matched-row-groups branch from 258a28d to 76c0284 Compare April 13, 2026 13:05

alamb added the performance label Apr 14, 2026

xudong963 mentioned this pull request Apr 15, 2026

Skip RowFilter and page pruning for fully matched row groups apache/datafusion#21637

Open

alamb reviewed Apr 15, 2026

View reviewed changes

alamb marked this pull request as draft April 16, 2026 11:29

alamb marked this pull request as ready for review April 16, 2026 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): skip RowFilter evaluation for fully matched row groups#9694

feat(parquet): skip RowFilter evaluation for fully matched row groups#9694
xudong963 wants to merge 1 commit intoapache:mainfrom
xudong963:skip-filter-fully-matched-row-groups

xudong963 commented Apr 13, 2026 •

edited

Loading

Uh oh!

xudong963 commented Apr 15, 2026

Uh oh!

alamb left a comment

Uh oh!

xudong963 commented Apr 16, 2026

Uh oh!

xudong963 commented Apr 16, 2026

Uh oh!

alamb commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xudong963 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

xudong963 commented Apr 15, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

xudong963 commented Apr 16, 2026

Uh oh!

xudong963 commented Apr 16, 2026

Uh oh!

alamb commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xudong963 commented Apr 13, 2026 •

edited

Loading