Allow swapping projections / row filters at row group boundaries in Parquet reader#9968
Open
adriangb wants to merge 3 commits into
Open
Allow swapping projections / row filters at row group boundaries in Parquet reader#9968adriangb wants to merge 3 commits into
adriangb wants to merge 3 commits into
Conversation
403183a to
48d1657
Compare
Add a small surface to the push decoder so callers can swap the RowFilter, ProjectionMask, and/or RowSelectionPolicy at row-group boundaries without rebuilding the decoder: - new pub fn `can_swap_strategy() -> bool` — true between row groups (outer state `ReadingRowGroup`, inner state `Finished`) - new pub fn `swap_strategy(StrategySwap) -> Result<()>` — rejected with `ParquetError::General` when called mid-row-group - new `pub struct StrategySwap` (`#[non_exhaustive]`) with builder methods `with_projection`, `with_filter`, `with_row_selection_policy` - new pub fn `row_groups_remaining() -> usize` for diagnostics `PushBuffers` carries through the swap, so bytes already fetched for columns that survive into the new strategy are reused — only bytes the new strategy needs but that aren't already buffered get requested via `NeedsData`. Adaptive callers should drive the decoder with `try_next_reader` rather than `try_decode`: handing the active reader off transitions the decoder back to `ReadingRowGroup` immediately, giving the caller a clean swap window between two consecutive returns. `try_decode` loops past row-group boundaries internally and is unsuitable for in-flight strategy changes. Tests cover: filter swap between row groups, mid-row-group rejection, swap-while-iterating-handed-off-reader, and projection narrowing that reuses already-buffered bytes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`PushBuffers` is `pub(crate)` in the arrow-rs workspace, so the intra-doc link `[\`PushBuffers\`]: crate::util::push_buffers::PushBuffers` in `ParquetPushDecoder::swap_strategy`'s public docs failed the `-D rustdoc::private-intra-doc-links` check on `cargo +nightly doc --document-private-items`. The rest of the doc is unchanged; the prose now refers to the type by name without a link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
48d1657 to
fee4744
Compare
Contributor
Author
|
@HippoBaro @alamb mind taking a look at this? |
Add three swap_strategy tests: - expands_projection_between_row_groups: narrow -> wide swap with the whole file prefetched; verifies RG1 decodes the widened projection using already-buffered bytes. - expand_projection_requests_new_bytes: narrow -> wide swap driven incrementally; pins the post-swap NeedsData to the three RG1 column chunks (11168 bytes total), confirming the decoder fetches bytes for the newly-added columns. - narrow_projection_skips_unneeded_bytes: wide -> narrow swap driven incrementally; pins the post-swap NeedsData to the single RG1 column-a chunk (1856 bytes), confirming the decoder skips bytes for columns dropped from the projection. The incremental tests compare RG1-with-projection-X against RG1-with-projection-Y (not RG0 vs RG1), so the asymmetry isolates the projection change rather than per-row-group size variation in the StringView column "c". Expected ranges are hardcoded since TEST_BATCH and the writer settings are static. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
752de6e to
b75d300
Compare
Contributor
|
I am OOO until next Wed but will try to find some time to look at this! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is the decoder piece of the work presented at the NYC DataFusion meetup.
The idea is that we'll be able to adaptively promote and demote filters into row filters based on runtime selectivity stats.