Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune #10051

Merged
merged 10 commits into from
Apr 14, 2024

Conversation

Ted-Jiang
Copy link
Member

@Ted-Jiang Ted-Jiang commented Apr 12, 2024

and fix NOT NULL prune

Which issue does this PR close?

Closes #9961
Closes #10049.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@Ted-Jiang Ted-Jiang requested a review from alamb April 12, 2024 07:13
@github-actions github-actions bot added the core Core datafusion crate label Apr 12, 2024
None
// see https://github.com/apache/arrow-rs/blob/91f0b1771308609ca27db0fb1d2d49571b3980d8/parquet/src/file/metadata.rs#L979-L982
let mut first_row_index = self
.col_offset_indexes
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can do this same calculation without allocating intermediate vec s-- something like this:

        let row_count_per_page = self
            .col_offset_indexes
            .windows(2)
            .map(|location| Some(location[1].first_row_index - location[0].first_row_index));

        Some(Arc::new(Int64Array::from_iter(row_count_per_page)))

🤔 the name col_offset_indexes is somewhat confusing to me as they are PageLocations --

    col_offset_indexes: &'a Vec<PageLocation>,

maybe we could rename that field to page_locations 🤔

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb thanks to your always conscientious review and kindly suggestions . 👍

Avoid allocate in b2d3641

In my mind page_locations store the offsets in the column chunk , i think they are the same meaning 🤣

@Ted-Jiang Ted-Jiang changed the title Prune pages are all null in ParquetExec by row_counts Prune pages are all null in ParquetExec by row_counts and fix NOT NULL prune Apr 12, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ted-Jiang -- this is looking good. I think it could be merged as is, though I also have several suggestions that I think will make the code and comments better. We can do so as a follow on PR though.

datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
datafusion/core/src/physical_optimizer/pruning.rs Outdated Show resolved Hide resolved
@@ -824,9 +863,17 @@ fn create_data_batch(scenario: Scenario) -> Vec<RecordBatch> {
}
Scenario::WithNullValues => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please also add the "some null/some not null" null values case to this scenario as well?

                make_int_batches_with_null(5, 1, 6),

}
Scenario::WithNullValuesPageLevel => {
vec![
make_int_batches_with_null(5, 1, 6),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this scenario should have at least one batch of all nulls (e.g. make_int_batches_with_nulls(5, 0, 0))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this will be prune by row group -> column level

Comment on lines +640 to +641
no_null_values_start: usize,
no_null_values_end: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you might be able to simplify this method by sending in 2 parameters. Right now it looks like it interleaves nulls arbitrarily but really the nulls are always at the start and non nulls are at the end

  num_nulls: usize,
  non_nulls: usize

// row_group4: page1(1~5), page2(All Null)
// total 40 rows
async fn test_pages_with_null_values() {
test_prune(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a while to convince myself that this was actually setting up the scenario as described. I eventually found it here:

https://github.com/alamb/arrow-datafusion/blob/dee926519030301f052dc2c3196e4fbef0da4c47/datafusion/core/tests/parquet/mod.rs#L916-L920

I wonder if it would be possible to add some better comments to test_prune mentioning that the created parquet files have 5 rows per page

Copy link
Member Author

@Ted-Jiang Ted-Jiang Apr 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i will make another pr to improve this tests

datafusion/core/tests/parquet/page_pruning.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/page_pruning.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/page_pruning.rs Outdated Show resolved Hide resolved
datafusion/core/tests/parquet/page_pruning.rs Outdated Show resolved Hide resolved
Ted-Jiang and others added 8 commits April 13, 2024 18:34
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@Ted-Jiang Ted-Jiang merged commit 671cef8 into apache:main Apr 14, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 15, 2024

Thanks again @Ted-Jiang -- I took another look through this PR and it is very nice 👌

appletreeisyellow pushed a commit to influxdata/arrow-datafusion that referenced this pull request Apr 24, 2024
…L prune (apache#10051)

* Prune pages are all null in ParquetExec by row_counts
and fix NOT NULL prune

* fix clippy

* Update datafusion/core/src/physical_optimizer/pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/tests/parquet/page_pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/tests/parquet/page_pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/tests/parquet/page_pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* Update datafusion/core/tests/parquet/page_pruning.rs

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

* remove allocate vec

* better way avoid allocate vec

* simply expr

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core datafusion crate
Projects
None yet
2 participants