ARROW-11014: [Rust] [DataFusion] Use correct statistics for ParquetExec #8992

andygrove · 2020-12-22T23:22:13Z

ParquetExec represents multiple files but we were calculating statistics based on the first file.

I stumbled across this when working on https://issues.apache.org/jira/browse/ARROW-10995

andygrove · 2020-12-22T23:24:11Z

codecov-io · 2020-12-22T23:36:47Z

Codecov Report

Merging #8992 (0f21d1a) into master (0519c4c) will increase coverage by 0.01%.
The diff coverage is 85.41%.

@@            Coverage Diff             @@
##           master    #8992      +/-   ##
==========================================
+ Coverage   82.64%   82.66%   +0.01%     
==========================================
  Files         200      200              
  Lines       49730    49798      +68     
==========================================
+ Hits        41098    41164      +66     
- Misses       8632     8634       +2

Impacted Files	Coverage Δ
rust/datafusion/src/datasource/datasource.rs	`100.00% <ø> (ø)`
rust/datafusion/src/physical_plan/parquet.rs	`80.31% <84.44%> (+0.74%)`	⬆️
rust/datafusion/src/datasource/parquet.rs	`96.92% <100.00%> (+1.30%)`	⬆️
rust/parquet/src/file/metadata.rs	`91.05% <0.00%> (-0.78%)`	⬇️
rust/parquet/src/schema/types.rs	`89.93% <0.00%> (-0.27%)`	⬇️
rust/parquet/src/encodings/encoding.rs	`95.43% <0.00%> (+0.19%)`	⬆️
rust/datafusion/src/physical_plan/expressions.rs	`84.49% <0.00%> (+0.31%)`	⬆️
rust/arrow/src/util/test_util.rs	`90.90% <0.00%> (+15.90%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 081728f...0f21d1a. Read the comment docs.

seddonm1 · 2020-12-22T23:38:40Z

Thanks. Good find. 🤦 by me

github-actions · 2020-12-22T23:44:40Z

https://issues.apache.org/jira/browse/ARROW-11014

alamb

I wonder if there is some way to test this code -- I also remember being confused about the fact that ParquetExec can actually take a directory full of files rather than a single one.

alamb · 2020-12-23T13:12:33Z

rust/datafusion/src/physical_plan/parquet.rs

 use async_trait::async_trait;
 use futures::stream::Stream;

-/// Execution plan for scanning a Parquet file
+/// Execution plan for scanning one or more Parquet files


alamb · 2020-12-23T13:15:49Z

rust/datafusion/src/physical_plan/parquet.rs

+            let mut total_byte_size = 0;
+            for file in &filenames {
+                let file = File::open(file)?;
+                let file_reader = Arc::new(SerializedFileReader::new(file)?);


It probably doesn't matter but we are creating arrow_readers several times for the same file -- like here we create them just to read metadata, and then right below we (re)open the first one again to read the schema. And then we open them again to actually read data...

This is a good point. I've pushed another change here to collect unique schemas during the scan of the files to avoid the separate read. This now also implements a check to make sure the schemas are the same. I have wasted time in the past tracking down issues due to incompatible files. I added a reference to the issue for implementing schema merging, which would be a nice addition.

I've gone a little further and introduced a ParquetPartition struct to make things more explicit about how partitioning works and added references to related issues for changing the partitioning strategy. I also improved an error message and added more documentation.

andygrove · 2020-12-23T17:52:39Z

I wonder if there is some way to test this code -- I also remember being confused about the fact that ParquetExec can actually take a directory full of files rather than a single one.

Yes, tests are definitely lacking here. I will take this on as a follow-up task for the release: https://issues.apache.org/jira/browse/ARROW-11020

andygrove · 2020-12-23T17:54:17Z

@alamb @jorgecarleitao I got a bit carried away with some other improvements in this PR but I am going to stop now. I filed a follow-up issue to add more comprehensive tests before we release 3.0.0

andygrove · 2020-12-23T18:54:38Z

Also @Dandandan this starts to introduce some per-partition stats now

Dandandan · 2020-12-23T19:12:49Z

Makes sense. Maybe we can reuse the summing stats for all partitions for different source types if it gets more complex.

Dandandan · 2020-12-23T19:14:55Z

rust/datafusion/src/physical_plan/parquet.rs

+            // build a list of Parquet partitions with statistics and gather all unique schemas
+            // used in this data set
+            let mut schemas: Vec<Schema> = vec![];
+            let mut partitions = vec![];


Could use with_capacity

Dandandan · 2020-12-23T19:18:08Z

rust/datafusion/src/physical_plan/parquet.rs

+        }
+        let statistics = Statistics {
+            num_rows: if num_rows == 0 {
+                None


Why map rows and byte size to None here and not Some(0)?

I think if the result is 0, it is best to know that there are 0 records/bytes.

Yes, that was a bit sloppy. This is now fixed.

seddonm1 · 2020-12-23T19:21:57Z

rust/datafusion/src/physical_plan/parquet.rs

    pub fn new(
-        filenames: Vec<String>,
+        partitions: Vec<ParquetPartition>,


I thought that part of the reason for allowing a list of files was to support similar behavior to DeltaLake where an external file contains a list of filenames representing a version of data rather than grouping them by directory structure?

Interesting. I was not aware of this use case. Perhaps we need a specific constructor for that use case. I'll take a look.

It may be premature to support that use case anyway until more of the core engine works.

I pushed a change so there are now try_from_path and try_from_files constructors

Looks good 👍

Dandandan · 2020-12-23T19:24:03Z

1 comment that I think should be resolved. The rest looks good, great that this is fixed.
Agree also that there should be some tests 👍

alamb

Nice

alamb · 2020-12-24T13:19:51Z

rust/datafusion/src/physical_plan/parquet.rs

+                num_rows: Some(num_rows as usize),
+                total_byte_size: Some(total_byte_size as usize),
+            };
+            partitions.push(ParquetPartition {


ParquetExec represents multiple files but we were calculating statistics based on the first file. I stumbled across this when working on https://issues.apache.org/jira/browse/ARROW-10995 Closes apache#8992 from andygrove/ARROW-11014 Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Andy Grove <andygrove73@gmail.com>

more accurate statistics for Parquet

74a5750

github-actions bot added Component: Rust - DataFusion Component: Rust labels Dec 22, 2020

jorgecarleitao approved these changes Dec 23, 2020

View reviewed changes

alamb reviewed Dec 23, 2020

View reviewed changes

andygrove added 5 commits December 23, 2020 09:24

avoid separate read for schema

2320047

make partitioning schema clearer

8947d30

store statistics per partition

e61fe43

fix regression

e50cb12

calculate statistics in constructor that accepts partitions

3fa2989

Dandandan reviewed Dec 23, 2020

View reviewed changes

seddonm1 reviewed Dec 23, 2020

View reviewed changes

andygrove added 4 commits December 23, 2020 13:18

try_from_path and try_from_files

a2165eb

address feedback

f56cdca

address feedback

8e9cecf

change signature to use slice of &str instead of String

0f21d1a

Dandandan approved these changes Dec 23, 2020

View reviewed changes

andygrove closed this in f9efa02 Dec 23, 2020

alamb reviewed Dec 24, 2020

View reviewed changes

asfimport mentioned this pull request Dec 24, 2020

[Rust] [DataFusion] ParquetExec reports incorrect statistics #26933

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-11014: [Rust] [DataFusion] Use correct statistics for ParquetExec #8992

ARROW-11014: [Rust] [DataFusion] Use correct statistics for ParquetExec #8992

andygrove commented Dec 22, 2020 •

edited

andygrove commented Dec 22, 2020

codecov-io commented Dec 22, 2020 •

edited

seddonm1 commented Dec 22, 2020 •

edited

github-actions bot commented Dec 22, 2020

alamb left a comment

alamb Dec 23, 2020

alamb Dec 23, 2020

andygrove Dec 23, 2020

andygrove Dec 23, 2020

andygrove commented Dec 23, 2020

andygrove commented Dec 23, 2020

andygrove commented Dec 23, 2020

Dandandan commented Dec 23, 2020

Dandandan Dec 23, 2020

Dandandan Dec 23, 2020

Dandandan Dec 23, 2020

andygrove Dec 23, 2020

seddonm1 Dec 23, 2020

andygrove Dec 23, 2020

seddonm1 Dec 23, 2020

andygrove Dec 23, 2020

seddonm1 Dec 23, 2020

Dandandan commented Dec 23, 2020

alamb left a comment

alamb Dec 24, 2020

ARROW-11014: [Rust] [DataFusion] Use correct statistics for ParquetExec #8992

ARROW-11014: [Rust] [DataFusion] Use correct statistics for ParquetExec #8992

Conversation

andygrove commented Dec 22, 2020 • edited

andygrove commented Dec 22, 2020

codecov-io commented Dec 22, 2020 • edited

Codecov Report

seddonm1 commented Dec 22, 2020 • edited

github-actions bot commented Dec 22, 2020

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Dec 23, 2020

andygrove commented Dec 23, 2020

andygrove commented Dec 23, 2020

Dandandan commented Dec 23, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dandandan commented Dec 23, 2020

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Dec 22, 2020 •

edited

codecov-io commented Dec 22, 2020 •

edited

seddonm1 commented Dec 22, 2020 •

edited