Skip to content

Conversation

@zhuqi-lucas
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Reverse row selection should respect the row group index, this PR will fix the issue.

What changes are included in this PR?

Reverse row selection should respect the row group index, this PR will fix the issue.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@zhuqi-lucas zhuqi-lucas requested review from alamb and Copilot December 30, 2025 05:19
@github-actions github-actions bot added the datasource Changes to the datasource crate label Dec 30, 2025
@zhuqi-lucas zhuqi-lucas requested a review from adriangb December 30, 2025 05:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a bug where the reverse_row_selection function didn't correctly handle cases where not all row groups in a Parquet file are being scanned. The fix adds a new parameter row_groups_to_scan to map the row selection to the specific row groups being accessed.

Key changes:

  • Modified reverse_row_selection to accept a row_groups_to_scan parameter that specifies which row groups are actually being scanned
  • Updated the algorithm to build row ranges only for scanned row groups rather than all row groups in the file
  • Added comprehensive documentation explaining the relationship between row selection and row groups to scan

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
datafusion/datasource-parquet/src/sort.rs Updated reverse_row_selection function signature to include row_groups_to_scan parameter, modified algorithm to use only scanned row groups, enhanced documentation, updated all test cases, and added new test for the bug fix scenario
datafusion/datasource-parquet/src/opener.rs Updated call to reverse_row_selection to pass the original (non-reversed) row group indexes

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

let mut rg_row_ranges: Vec<(usize, usize, usize)> =
Vec::with_capacity(rg_metadata.len());
Vec::with_capacity(row_groups_to_scan.len());
let mut current_row = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the tracking for current_row still correct? We used to scan every row so I think it was correct, but now we if we skip a row group won't current_row be incorrect globally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @adriangb , it's still correct! current_row now tracks relative positions within scanned row groups, not global file positions. This matches how RowSelection works - it only covers rows from the row groups being scanned, with no knowledge of skipped groups. Added clarifying comments

RowSelection::from(vec![RowSelector::select(50), RowSelector::skip(250)]);

let reversed = reverse_row_selection(&selection, &metadata).unwrap();
let row_groups_to_scan = vec![0, 1, 2];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't row_groups_to_scan derived from the RowSelection? So in this case it should be vec![0]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more comments now, and used reverse_access_plan in latest PR, it should be more clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was more so that we should be testing closer to the production code path, which I think involves building a PreparedAccessPlan and calling PreparedAccessPlan::reverse

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, i got it now @adriangb , i will try to address it, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest PR.

row_selection: Option<parquet::arrow::arrow_reader::RowSelection>,
}

impl PreparedAccessPlan {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should be writing tests as:

/// Test helper
fn reverse_access_plan(plan: ParquetAccessPlan, metadata: &ParquetMetaData) -> RowSelection {...}

/// Test code
let plan = ParquetAccessPlan...
let expected = RowSelection ...
let reversed = reverse_access_plan(plan, metadata)
assert_eq!(expected, reversed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point @adriangb , this looks better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest commit now @adriangb , thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in latest PR.

self.row_selection = Some(reverse_row_selection(
&row_selection,
file_metadata,
&row_groups_to_scan, // Pass the original (non-reversed) row group indexes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adriangb The main change is to pass row_group_indexes to reverse logic instead of using all indexes.

Because i saw the following logic:

pub fn into_overall_row_selection(

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Dec 30, 2025
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Dec 30, 2025
@zhuqi-lucas
Copy link
Contributor Author

Thank you @adriangb for review, i have addressed the comments and used 100% real cases to test now which building a PreparedAccessPlan and calling PreparedAccessPlan::reverse for all tests in sort.rs.

Comment on lines 186 to 188
// If no selection originally, after reversal should still select all rows
if let Some(selection) = reversed_plan.row_selection {
let total_selected: usize = selection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this conditional necessary? Don't we know if there is/isn't a row selection since we just created it? I.e. we can .unwrap()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @adriangb , this case the selection should be None due to select all rows, so i changed the test case to assert None in latest commit, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated:

 #[test]
    fn test_prepared_access_plan_reverse_simple() {
        // Test: all row groups are scanned, no row selection
        let metadata = create_test_metadata(vec![100, 100, 100]);

        let access_plan = ParquetAccessPlan::new_all(3);
        let rg_metadata = metadata.row_groups();

        let prepared_plan =
            PreparedAccessPlan::from_access_plan(access_plan, rg_metadata)
                .expect("Failed to create PreparedAccessPlan");

        // Verify original plan
        assert_eq!(prepared_plan.row_group_indexes, vec![0, 1, 2]);

        // No row selection originally due to scanning all rows
        assert_eq!(prepared_plan.row_selection, None);

        let reversed_plan = prepared_plan
            .reverse(&metadata)
            .expect("Failed to reverse PreparedAccessPlan");

        // Verify row groups are reversed
        assert_eq!(reversed_plan.row_group_indexes, vec![2, 1, 0]);

        // If no selection originally, after reversal should still select all rows,
        // and the selection should be None
        assert_eq!(reversed_plan.row_selection, None);
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reverse row selection should respect the row group index

2 participants