-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add End-to-end test for parquet pruning + metrics for ParquetExec #657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| use async_trait::async_trait; | ||
| use futures::stream::{Stream, StreamExt}; | ||
|
|
||
| use super::SQLMetric; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to this file are to add metrics on the pruning (that are then used in the test)
| Field::new("millis", arr_millis.data_type().clone(), false), | ||
| Field::new("secs", arr_secs.data_type().clone(), false), | ||
| Field::new("name", arr_names.data_type().clone(), false), | ||
| Field::new("nanos", arr_nanos.data_type().clone(), true), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a bug I found while using similar code for the end to end test - the actual data has None so the schema needs to be marked "nullable"
| use parquet::{arrow::ArrowWriter, file::properties::WriterProperties}; | ||
| use tempfile::NamedTempFile; | ||
|
|
||
| #[tokio::test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the new end to end tests -- they make an actual parquet file and then run a query against it, validating the pruning metrics -- they currently "pass" but they show there is no actual pruning occuring
|
FYI @houqp and @yordan-pavlov |
| /// Statistics for the data set (sum of statistics for all partitions) | ||
| statistics: Statistics, | ||
| /// Numer of times the pruning predicate could not be created | ||
| predicate_creation_errors: Arc<SQLMetric>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to move predicate_creation_errors into ParquetMetrics, next to the other two metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I didn't do this is because the other metrics are per ParquetPartition (aka per each file / set of files) but this one is just one metric for the overall ParquetExec
I agree this is confusing -- I'll make two statistics structures (ParquetPartitionMetrics and ParquetMetrics) which hopefully will make this clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in 595fcdd07b
| match predicate_values { | ||
| Ok(values) => Box::new(move |_, i| { | ||
| // NB: false means don't scan row group | ||
| if !values[i] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the counting of filtered row groups happen in the actual filter function? what is the benefit of doing the counting inside the filter function? why not move it outside, just before the filter function is returned (similar to the error case below)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One difference that results from updating the metrics right before use is what happens in a limit query -- ala SELECT * from parquet_table where date < 2012-20-20 limit 10 -- in this case the query might never even contemplate reading a particular partition due to the limit
However, having written that I think it would make the statistics easier to understand (and more consistent query to query) to report the stats prior to the actual function. I will make that change
|
looks very interesting @alamb, I wonder how else these metrics could be used; could be useful for general diagnostics and troubleshooting performance of queries. |
Indeed @yordan-pavlov - in fact I think @andygrove is doing just that with PRs such as #662 👍 |
159dde8 to
37d97f7
Compare
|
@Dandandan / @yordan-pavlov / @houqp , any concerns about merging this one in? I now have the fix PRs backed up behind this test PR so I would like to get the test in so I can get the fixes up for review |
|
No concerns! |
This is the test harness I intend to use to validate the fix for #656 and #649.
Rationale:
Parquet pruning is broken (#656 see #649) but all our tests are passing. This is not good ...
The test and infrastructure is pretty large itself so I wanted to get it reviewed separately prior to the actual bug fixes
Changes
SQLMetricsafter a plan has been executedI plan to extend the parquet test to cover more cases as I work on fixing the bugs
Are there any user-facing changes?
There are some new user facing SQL metrics