Skip to content

Allow swapping projections / row filters at row group boundaries in Parquet reader#9968

Open
adriangb wants to merge 3 commits into
apache:mainfrom
pydantic:adaptive-strategy-swap
Open

Allow swapping projections / row filters at row group boundaries in Parquet reader#9968
adriangb wants to merge 3 commits into
apache:mainfrom
pydantic:adaptive-strategy-swap

Conversation

@adriangb
Copy link
Copy Markdown
Contributor

This is the decoder piece of the work presented at the NYC DataFusion meetup.

The idea is that we'll be able to adaptively promote and demote filters into row filters based on runtime selectivity stats.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 13, 2026
@adriangb adriangb force-pushed the adaptive-strategy-swap branch from 403183a to 48d1657 Compare May 13, 2026 05:10
adriangb and others added 2 commits May 13, 2026 01:17
Add a small surface to the push decoder so callers can swap the
RowFilter, ProjectionMask, and/or RowSelectionPolicy at row-group
boundaries without rebuilding the decoder:

- new pub fn `can_swap_strategy() -> bool` — true between row groups
  (outer state `ReadingRowGroup`, inner state `Finished`)
- new pub fn `swap_strategy(StrategySwap) -> Result<()>` — rejected
  with `ParquetError::General` when called mid-row-group
- new `pub struct StrategySwap` (`#[non_exhaustive]`) with builder
  methods `with_projection`, `with_filter`, `with_row_selection_policy`
- new pub fn `row_groups_remaining() -> usize` for diagnostics

`PushBuffers` carries through the swap, so bytes already fetched for
columns that survive into the new strategy are reused — only bytes the
new strategy needs but that aren't already buffered get requested via
`NeedsData`.

Adaptive callers should drive the decoder with `try_next_reader`
rather than `try_decode`: handing the active reader off transitions
the decoder back to `ReadingRowGroup` immediately, giving the caller
a clean swap window between two consecutive returns. `try_decode`
loops past row-group boundaries internally and is unsuitable for
in-flight strategy changes.

Tests cover: filter swap between row groups, mid-row-group rejection,
swap-while-iterating-handed-off-reader, and projection narrowing
that reuses already-buffered bytes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`PushBuffers` is `pub(crate)` in the arrow-rs workspace, so the
intra-doc link `[\`PushBuffers\`]: crate::util::push_buffers::PushBuffers`
in `ParquetPushDecoder::swap_strategy`'s public docs failed the
`-D rustdoc::private-intra-doc-links` check on `cargo +nightly doc
--document-private-items`. The rest of the doc is unchanged; the
prose now refers to the type by name without a link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@adriangb adriangb force-pushed the adaptive-strategy-swap branch from 48d1657 to fee4744 Compare May 13, 2026 05:17
@adriangb adriangb marked this pull request as ready for review May 13, 2026 05:21
@adriangb
Copy link
Copy Markdown
Contributor Author

@HippoBaro @alamb mind taking a look at this?

Add three swap_strategy tests:

- expands_projection_between_row_groups: narrow -> wide swap with the
  whole file prefetched; verifies RG1 decodes the widened projection
  using already-buffered bytes.
- expand_projection_requests_new_bytes: narrow -> wide swap driven
  incrementally; pins the post-swap NeedsData to the three RG1 column
  chunks (11168 bytes total), confirming the decoder fetches bytes
  for the newly-added columns.
- narrow_projection_skips_unneeded_bytes: wide -> narrow swap driven
  incrementally; pins the post-swap NeedsData to the single RG1
  column-a chunk (1856 bytes), confirming the decoder skips bytes
  for columns dropped from the projection.

The incremental tests compare RG1-with-projection-X against
RG1-with-projection-Y (not RG0 vs RG1), so the asymmetry isolates
the projection change rather than per-row-group size variation
in the StringView column "c". Expected ranges are hardcoded since
TEST_BATCH and the writer settings are static.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@HippoBaro
Copy link
Copy Markdown
Contributor

I am OOO until next Wed but will try to find some time to look at this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants