Add End-to-end test for parquet pruning + metrics for ParquetExec #657

alamb · 2021-07-03T11:17:12Z

This is the test harness I intend to use to validate the fix for #656 and #649.

Rationale:

Parquet pruning is broken (#656 see #649) but all our tests are passing. This is not good ...

The test and infrastructure is pretty large itself so I wanted to get it reviewed separately prior to the actual bug fixes

Changes

Add an end-to-end test for parquet pruning
Add statistics to parquet reader (that are used in the tests)
Add a "plan_statistics" function to gather all SQLMetrics after a plan has been executed

I plan to extend the parquet test to cover more cases as I work on fixing the bugs

Are there any user-facing changes?

There are some new user facing SQL metrics

alamb · 2021-07-03T11:18:33Z

datafusion/src/physical_plan/parquet.rs

 use async_trait::async_trait;
 use futures::stream::{Stream, StreamExt};

+use super::SQLMetric;


The changes to this file are to add metrics on the pruning (that are then used in the test)

alamb · 2021-07-03T11:18:53Z

datafusion/src/test/mod.rs

-        Field::new("millis", arr_millis.data_type().clone(), false),
-        Field::new("secs", arr_secs.data_type().clone(), false),
-        Field::new("name", arr_names.data_type().clone(), false),
+        Field::new("nanos", arr_nanos.data_type().clone(), true),


This was a bug I found while using similar code for the end to end test - the actual data has None so the schema needs to be marked "nullable"

alamb · 2021-07-03T11:19:59Z

datafusion/tests/parquet_pruning.rs

+use parquet::{arrow::ArrowWriter, file::properties::WriterProperties};
+use tempfile::NamedTempFile;
+
+#[tokio::test]


Here are the new end to end tests -- they make an actual parquet file and then run a query against it, validating the pruning metrics -- they currently "pass" but they show there is no actual pruning occuring

alamb · 2021-07-03T11:20:41Z

FYI @houqp and @yordan-pavlov

yordan-pavlov · 2021-07-03T18:45:12Z

datafusion/src/physical_plan/parquet.rs

    /// Statistics for the data set (sum of statistics for all partitions)
    statistics: Statistics,
+    /// Numer of times the pruning predicate could not be created
+    predicate_creation_errors: Arc<SQLMetric>,


would it make sense to move predicate_creation_errors into ParquetMetrics, next to the other two metrics?

The reason I didn't do this is because the other metrics are per ParquetPartition (aka per each file / set of files) but this one is just one metric for the overall ParquetExec

I agree this is confusing -- I'll make two statistics structures (ParquetPartitionMetrics and ParquetMetrics) which hopefully will make this clearer

in 595fcdd07b

yordan-pavlov · 2021-07-03T18:54:33Z

datafusion/src/physical_plan/parquet.rs

+    match predicate_values {
+        Ok(values) => Box::new(move |_, i| {
+            // NB: false means don't scan row group
+            if !values[i] {


should the counting of filtered row groups happen in the actual filter function? what is the benefit of doing the counting inside the filter function? why not move it outside, just before the filter function is returned (similar to the error case below)?

One difference that results from updating the metrics right before use is what happens in a limit query -- ala SELECT * from parquet_table where date < 2012-20-20 limit 10 -- in this case the query might never even contemplate reading a particular partition due to the limit

However, having written that I think it would make the statistics easier to understand (and more consistent query to query) to report the stats prior to the actual function. I will make that change

yordan-pavlov · 2021-07-03T21:15:01Z

looks very interesting @alamb, I wonder how else these metrics could be used; could be useful for general diagnostics and troubleshooting performance of queries.

alamb · 2021-07-05T11:12:58Z

looks very interesting @alamb, I wonder how else these metrics could be used; could be useful for general diagnostics and troubleshooting performance of queries.

Indeed @yordan-pavlov - in fact I think @andygrove is doing just that with PRs such as #662 👍

alamb · 2021-07-06T14:53:32Z

@Dandandan / @yordan-pavlov / @houqp , any concerns about merging this one in? I now have the fix PRs backed up behind this test PR so I would like to get the test in so I can get the fixes up for review

Dandandan · 2021-07-06T16:18:14Z

No concerns!

github-actions bot added the datafusion label Jul 3, 2021

alamb commented Jul 3, 2021

View reviewed changes

yordan-pavlov reviewed Jul 3, 2021

View reviewed changes

alamb added 4 commits July 5, 2021 07:18

End to end tests for parquet pruning

ab905f9

remove unused dep

998070b

Make the separation of per-partition and per-exec metrics clearer

595fcdd

Account for statistics once rather than per row group

37d97f7

alamb force-pushed the alamb/parquet_pruning_end_to_end branch from 159dde8 to 37d97f7 Compare July 5, 2021 11:37

alamb mentioned this pull request Jul 5, 2021

Improved features and interoperability for SQLMetrics #679

Closed

Fix timestamps to use UTC time

6b773b2

This was referenced Jul 6, 2021

Remove qualifiers on pushed down predicates / Fix parquet pruning #689

Merged

Fix Date32 and Date64 parquet row group pruning #690

Merged

alamb merged commit 8cbb750 into apache:master Jul 6, 2021

alamb deleted the alamb/parquet_pruning_end_to_end branch July 6, 2021 16:41

houqp added the enhancement New feature or request label Jul 31, 2021

unkloud pushed a commit to unkloud/datafusion that referenced this pull request Mar 23, 2025

Upgrade to DataFusion 40 (apache#657)

07ac8d2

Add End-to-end test for parquet pruning + metrics for ParquetExec #657

Add End-to-end test for parquet pruning + metrics for ParquetExec #657

Uh oh!

Conversation

alamb commented Jul 3, 2021

Rationale:

Changes

Are there any user-facing changes?

Uh oh!

alamb Jul 3, 2021

Choose a reason for hiding this comment

Uh oh!

alamb Jul 3, 2021

Choose a reason for hiding this comment

Uh oh!

alamb Jul 3, 2021

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 3, 2021

Uh oh!

yordan-pavlov Jul 3, 2021

Choose a reason for hiding this comment

Uh oh!

alamb Jul 5, 2021

Choose a reason for hiding this comment

Uh oh!

alamb Jul 5, 2021

Choose a reason for hiding this comment

Uh oh!

yordan-pavlov Jul 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jul 5, 2021

Choose a reason for hiding this comment

Uh oh!

yordan-pavlov commented Jul 3, 2021

Uh oh!

alamb commented Jul 5, 2021

Uh oh!

alamb commented Jul 6, 2021

Uh oh!

Dandandan commented Jul 6, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yordan-pavlov Jul 3, 2021 •

edited

Loading