-
Notifications
You must be signed in to change notification settings - Fork 1.9k
fix: Reverse row selection should respect the row group index #19557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a bug where the reverse_row_selection function didn't correctly handle cases where not all row groups in a Parquet file are being scanned. The fix adds a new parameter row_groups_to_scan to map the row selection to the specific row groups being accessed.
Key changes:
- Modified
reverse_row_selectionto accept arow_groups_to_scanparameter that specifies which row groups are actually being scanned - Updated the algorithm to build row ranges only for scanned row groups rather than all row groups in the file
- Added comprehensive documentation explaining the relationship between row selection and row groups to scan
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
datafusion/datasource-parquet/src/sort.rs |
Updated reverse_row_selection function signature to include row_groups_to_scan parameter, modified algorithm to use only scanned row groups, enhanced documentation, updated all test cases, and added new test for the bug fix scenario |
datafusion/datasource-parquet/src/opener.rs |
Updated call to reverse_row_selection to pass the original (non-reversed) row group indexes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let mut rg_row_ranges: Vec<(usize, usize, usize)> = | ||
| Vec::with_capacity(rg_metadata.len()); | ||
| Vec::with_capacity(row_groups_to_scan.len()); | ||
| let mut current_row = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the tracking for current_row still correct? We used to scan every row so I think it was correct, but now we if we skip a row group won't current_row be incorrect globally?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @adriangb , it's still correct! current_row now tracks relative positions within scanned row groups, not global file positions. This matches how RowSelection works - it only covers rows from the row groups being scanned, with no knowledge of skipped groups. Added clarifying comments
| RowSelection::from(vec![RowSelector::select(50), RowSelector::skip(250)]); | ||
|
|
||
| let reversed = reverse_row_selection(&selection, &metadata).unwrap(); | ||
| let row_groups_to_scan = vec![0, 1, 2]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't row_groups_to_scan derived from the RowSelection? So in this case it should be vec![0]?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more comments now, and used reverse_access_plan in latest PR, it should be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My point was more so that we should be testing closer to the production code path, which I think involves building a PreparedAccessPlan and calling PreparedAccessPlan::reverse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, i got it now @adriangb , i will try to address it, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest PR.
| row_selection: Option<parquet::arrow::arrow_reader::RowSelection>, | ||
| } | ||
|
|
||
| impl PreparedAccessPlan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should be writing tests as:
/// Test helper
fn reverse_access_plan(plan: ParquetAccessPlan, metadata: &ParquetMetaData) -> RowSelection {...}
/// Test code
let plan = ParquetAccessPlan...
let expected = RowSelection ...
let reversed = reverse_access_plan(plan, metadata)
assert_eq!(expected, reversed)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point @adriangb , this looks better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest commit now @adriangb , thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in latest PR.
| self.row_selection = Some(reverse_row_selection( | ||
| &row_selection, | ||
| file_metadata, | ||
| &row_groups_to_scan, // Pass the original (non-reversed) row group indexes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adriangb The main change is to pass row_group_indexes to reverse logic instead of using all indexes.
Because i saw the following logic:
| pub fn into_overall_row_selection( |
|
Thank you @adriangb for review, i have addressed the comments and used 100% real cases to test now which building a PreparedAccessPlan and calling PreparedAccessPlan::reverse for all tests in sort.rs. |
| // If no selection originally, after reversal should still select all rows | ||
| if let Some(selection) = reversed_plan.row_selection { | ||
| let total_selected: usize = selection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this conditional necessary? Don't we know if there is/isn't a row selection since we just created it? I.e. we can .unwrap()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch @adriangb , this case the selection should be None due to select all rows, so i changed the test case to assert None in latest commit, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated:
#[test]
fn test_prepared_access_plan_reverse_simple() {
// Test: all row groups are scanned, no row selection
let metadata = create_test_metadata(vec![100, 100, 100]);
let access_plan = ParquetAccessPlan::new_all(3);
let rg_metadata = metadata.row_groups();
let prepared_plan =
PreparedAccessPlan::from_access_plan(access_plan, rg_metadata)
.expect("Failed to create PreparedAccessPlan");
// Verify original plan
assert_eq!(prepared_plan.row_group_indexes, vec![0, 1, 2]);
// No row selection originally due to scanning all rows
assert_eq!(prepared_plan.row_selection, None);
let reversed_plan = prepared_plan
.reverse(&metadata)
.expect("Failed to reverse PreparedAccessPlan");
// Verify row groups are reversed
assert_eq!(reversed_plan.row_group_indexes, vec![2, 1, 0]);
// If no selection originally, after reversal should still select all rows,
// and the selection should be None
assert_eq!(reversed_plan.row_selection, None);
}
Which issue does this PR close?
Rationale for this change
Reverse row selection should respect the row group index, this PR will fix the issue.
What changes are included in this PR?
Reverse row selection should respect the row group index, this PR will fix the issue.
Are these changes tested?
Yes
Are there any user-facing changes?
No