Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jul 3, 2021

This is the test harness I intend to use to validate the fix for #656 and #649.

Rationale:

Parquet pruning is broken (#656 see #649) but all our tests are passing. This is not good ...

The test and infrastructure is pretty large itself so I wanted to get it reviewed separately prior to the actual bug fixes

Changes

  1. Add an end-to-end test for parquet pruning
  2. Add statistics to parquet reader (that are used in the tests)
  3. Add a "plan_statistics" function to gather all SQLMetrics after a plan has been executed

I plan to extend the parquet test to cover more cases as I work on fixing the bugs

Are there any user-facing changes?

There are some new user facing SQL metrics

use async_trait::async_trait;
use futures::stream::{Stream, StreamExt};

use super::SQLMetric;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to this file are to add metrics on the pruning (that are then used in the test)

Field::new("millis", arr_millis.data_type().clone(), false),
Field::new("secs", arr_secs.data_type().clone(), false),
Field::new("name", arr_names.data_type().clone(), false),
Field::new("nanos", arr_nanos.data_type().clone(), true),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a bug I found while using similar code for the end to end test - the actual data has None so the schema needs to be marked "nullable"

use parquet::{arrow::ArrowWriter, file::properties::WriterProperties};
use tempfile::NamedTempFile;

#[tokio::test]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the new end to end tests -- they make an actual parquet file and then run a query against it, validating the pruning metrics -- they currently "pass" but they show there is no actual pruning occuring

@alamb
Copy link
Contributor Author

alamb commented Jul 3, 2021

FYI @houqp and @yordan-pavlov

/// Statistics for the data set (sum of statistics for all partitions)
statistics: Statistics,
/// Numer of times the pruning predicate could not be created
predicate_creation_errors: Arc<SQLMetric>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it make sense to move predicate_creation_errors into ParquetMetrics, next to the other two metrics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't do this is because the other metrics are per ParquetPartition (aka per each file / set of files) but this one is just one metric for the overall ParquetExec

I agree this is confusing -- I'll make two statistics structures (ParquetPartitionMetrics and ParquetMetrics) which hopefully will make this clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in 595fcdd07b

match predicate_values {
Ok(values) => Box::new(move |_, i| {
// NB: false means don't scan row group
if !values[i] {
Copy link
Contributor

@yordan-pavlov yordan-pavlov Jul 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the counting of filtered row groups happen in the actual filter function? what is the benefit of doing the counting inside the filter function? why not move it outside, just before the filter function is returned (similar to the error case below)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One difference that results from updating the metrics right before use is what happens in a limit query -- ala SELECT * from parquet_table where date < 2012-20-20 limit 10 -- in this case the query might never even contemplate reading a particular partition due to the limit

However, having written that I think it would make the statistics easier to understand (and more consistent query to query) to report the stats prior to the actual function. I will make that change

@yordan-pavlov
Copy link
Contributor

looks very interesting @alamb, I wonder how else these metrics could be used; could be useful for general diagnostics and troubleshooting performance of queries.

@alamb
Copy link
Contributor Author

alamb commented Jul 5, 2021

looks very interesting @alamb, I wonder how else these metrics could be used; could be useful for general diagnostics and troubleshooting performance of queries.

Indeed @yordan-pavlov - in fact I think @andygrove is doing just that with PRs such as #662 👍

@alamb alamb force-pushed the alamb/parquet_pruning_end_to_end branch from 159dde8 to 37d97f7 Compare July 5, 2021 11:37
@alamb
Copy link
Contributor Author

alamb commented Jul 6, 2021

@Dandandan / @yordan-pavlov / @houqp , any concerns about merging this one in? I now have the fix PRs backed up behind this test PR so I would like to get the test in so I can get the fixes up for review

@Dandandan
Copy link
Contributor

No concerns!

@alamb alamb merged commit 8cbb750 into apache:master Jul 6, 2021
@alamb alamb deleted the alamb/parquet_pruning_end_to_end branch July 6, 2021 16:41
@houqp houqp added the enhancement New feature or request label Jul 31, 2021
unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants