feat(parquet): separate push decoder frontier state from row-group decoding by HippoBaro · Pull Request #9804 · apache/arrow-rs

HippoBaro · 2026-04-24T04:34:10Z

Which issue does this PR close?

Prerequisite to feat(parquet): make PushBuffers boundary-agnostic for prefetch IO #9697

Rationale for this change

#9697 aims to make staged buffer management in the push decoder more explicit. In doing so, it exposes a structural problem: the logic for deciding whether a row group is still live, skipped, or unreachable is spread across several parts of the decoder.

This matters because row-group-level buffer release depends on a single question having a clear answer: can this row group ever need bytes again? That answer depends on the queued row groups, the remaining selection, the running offset/limit budget, and whether predicates require the decoder to stay conservative. Today, that state is split across multiple components, which makes the release policy difficult to centralize cleanly.

What changes are included in this PR?

This PR introduces a clearer ownership boundary in the push decoder:

cross-row-group scan state is now handled by a dedicated frontier/look-ahead mechanism
the row-group builder is reduced to current-row-group decode work only
offset/limit accounting and row-group selection advancement are centralized around that frontier/builder split

This does not implement row-group-level buffer release directly, but it establishes the structure needed for that follow-up work. It should also make future pruning rules easier to add and maintain.

Are these changes tested?

All existing tests pass, and the refactor adds focused coverage for the extracted budget logic and the frontier-driven try_next_reader path.

Are there any user-facing changes?

None.

Extract the push decoder offset/limit accounting into `RowBudget` and use it when planning row-group reads. This centralizes the row-count arithmetic needed to apply offset and limit without changing decoder behavior. It also adds focused tests for plain limit, offset+limit, and empty-selection cases so later frontier work can reuse the same accounting safely. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

Move the cross-row-group scan state into a dedicated `RowGroupFrontier`. The frontier now owns the queued row groups, the tail `RowSelection`, the running `RowBudget`, and the conservative "has predicates" flag. Reduce `RowGroupReaderBuilder` to current-row-group work only by threading a budget snapshot into `next_row_group` and returning a typed `RowGroupBuildResult`. This also folds in the selection-frontier cleanup so queued selection state is consumed in one place instead of through ad hoc split/clone logic. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

Teach the row-group frontier to seek ahead over queued row groups that can be proven unreachable before instantiating the row-group builder. Skip queued row groups when their selection slice is empty, when offset/limit leaves no rows to read, or when the remaining limit is already exhausted. Keep predicate-bearing row groups conservative and stop at the first row group that may still need data. Add a push decoder regression covering `try_next_reader` with offset/limit so the frontier path is exercised directly. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>

github-actions Bot added the parquet Changes to the parquet crate label Apr 24, 2026

HippoBaro added 2 commits April 24, 2026 01:13

HippoBaro force-pushed the frontier_row_group_selection branch from 600500d to 307e4d8 Compare April 24, 2026 05:17

HippoBaro mentioned this pull request Apr 24, 2026

feat(parquet): make PushBuffers boundary-agnostic for prefetch IO #9697

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parquet): separate push decoder frontier state from row-group decoding#9804

feat(parquet): separate push decoder frontier state from row-group decoding#9804
HippoBaro wants to merge 3 commits intoapache:mainfrom
HippoBaro:frontier_row_group_selection

HippoBaro commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

HippoBaro commented Apr 24, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant