Fix Parquet Reader's Arrow Schema Inference #1682

tustvold · 2022-05-09T13:11:30Z

Which issue does this PR close?

Closes #1655
Closes #1663
Closes #1652
Closes #1654
Closes #1681
Closes #1680
Closes #1484

Rationale for this change

See tickets, in particular #1655

What changes are included in this PR?

This separates the schema inference logic from the logic that reads the parquet file, this makes the logic clearer, easier to test, and hopefully less buggy.

Are there any user-facing changes?

Yes, schema inference may change. It will be more correct, but this is still a change.

We also explicitly no longer support out-of-order column projection, whereas previously it would be silently ignored in some code paths.

alamb · 2022-05-10T20:00:00Z

@tustvold please let me know when you would like any substantial review for this

Don't treat embedded arrow schema as authoritative (apache#1663) Fix projection of nested parquet files (apache#1652) (apache#1654) Fix schema inference for repeated fields (apache#1681) Support reading alternative list representations from parquet (apache#1680)

tustvold · 2022-05-11T07:18:25Z

Ok I've backed out the changes related to #1666 from this PR, so this should preserve the existing schema inference behaviour, and I think is ready for review

I'm confident that this PR will lay the ground work to make #1666 relatively straightforward.

tustvold · 2022-05-11T07:19:45Z

parquet/src/arrow/schema.rs

-    }
-}
-
-impl ParquetTypeConverter<'_> {


This logic is copied largely wholesale into schema/primitive.rs

tustvold · 2022-05-11T07:20:47Z

parquet/src/arrow/schema.rs

@@ -1261,7 +746,7 @@ mod tests {
        {
            arrow_fields.push(Field::new(
                "my_list",
-                DataType::List(Box::new(Field::new("element", DataType::Utf8, true))),
+                DataType::List(Box::new(Field::new("str", DataType::Utf8, false))),


Here we can see this fixing #1681

I think the comments need to be updated for the changes in code

// // List<String> (list nullable, elements non-null) // optional group my_list (LIST) { // repeated group element { // required binary str (UTF8); // }; // }

That comment is still correct, this is a nullable list with non-nullable elements, as described by that parquet schema.

The test was previously wrong

tustvold · 2022-05-11T07:21:48Z

parquet/src/arrow/schema.rs

@@ -1679,7 +1168,7 @@ mod tests {

        let parquet_schema = SchemaDescriptor::new(Arc::new(parquet_group_type));
        let converted_arrow_schema =
-            parquet_to_arrow_schema_by_columns(&parquet_schema, vec![3, 4, 0], None)
+            parquet_to_arrow_schema_by_columns(&parquet_schema, vec![0, 3, 4], None)


Out of order column projection previously would misbehave as parquet_to_arrow_schema_by_columns supported it, but the actual reader logic did not. This makes it consistently not supported, it will error, as it is hard to reason what the correct semantics are in the event of nested schema

I didn't see a test for the(new) error case -- I suggest adding one so we don't get accidental regressions

tustvold · 2022-05-11T07:24:31Z

parquet/src/arrow/schema/complex.rs

+    column_mask: Vec<bool>,
+}
+
+impl Visitor {


This is logic extracted from builder.rs, it wasn't possible to reuse the existing TypeVisitor as its handling of lists interfered with #1680

I really like this structure to encapsulate the logic for using the embedded schema. 👍

tustvold · 2022-05-11T07:24:48Z

parquet/src/arrow/schema/complex.rs

+        };
+
+        Ok(Some(match repetition {
+            Repetition::REPEATED => primitive_field.into_list(primitive_type.name()),


Here is the logic to now support #1680, there is comprehensive test coverage of this in schema.rs, in particular test_arrow_schema_roundtrip.

I'm actually quite pleased with this, despite the underlying list representation in parquet being fundamentally different, the ArrayBuilder can be completely oblivious to this fact 😄

tustvold · 2022-05-11T07:27:21Z

parquet/src/arrow/schema/primitive.rs

+
+/// Uses an type hint from the embedded arrow schema to aid in faithfully
+/// reproducing the data as it was written into parquet
+fn apply_hint(parquet: DataType, hint: DataType) -> DataType {


This is the change that fixes #1663 - we only use the arrow schema to hint types

I like this centralization of hinting logic -- it makes it easy to understand where arrow and parquet type systems aren't compatible

tustvold · 2022-05-11T07:29:10Z

parquet/src/arrow/schema/complex.rs

+    rep_level: i16,
+    def_level: i16,
+    /// An optional [`DataType`] sourced from the embedded arrow schema
+    data_type: Option<DataType>,


This is what fixes #1654 - we carry the DataType as we walk the tree, which prevents it from misbehaving

tustvold · 2022-05-11T07:29:37Z

parquet/src/arrow/schema/complex.rs

+}
+
+/// Representation of a parquet file, in terms of arrow schema elements
+pub struct ParquetField {


This is the new structure as described in #1655

tustvold · 2022-05-11T07:43:07Z

parquet/src/arrow/arrow_reader.rs

@@ -1050,6 +1050,41 @@ mod tests {
        for batch in record_batch_reader {
            batch.unwrap();
        }
+
+        let projected_reader = arrow_reader


This is a test for #1654 and #1652

Suggested change

let projected_reader = arrow_reader

// Test for https://github.com/apache/arrow-rs/issues/1654 and

// https://github.com/apache/arrow-rs/issues/1652

let projected_reader = arrow_reader

codecov-commenter · 2022-05-11T07:43:37Z

Codecov Report

Merging #1682 (e2f12de) into master (e02869a) will increase coverage by 0.10%.
The diff coverage is 80.62%.

❗ Current head e2f12de differs from pull request most recent head 5fd8cd8. Consider uploading reports for the commit 5fd8cd8 to get more accurate results

@@            Coverage Diff             @@
##           master    #1682      +/-   ##
==========================================
+ Coverage   83.15%   83.25%   +0.10%     
==========================================
  Files         193      195       +2     
  Lines       56007    56049      +42     
==========================================
+ Hits        46572    46665      +93     
+ Misses       9435     9384      -51

Impacted Files	Coverage Δ
parquet/src/errors.rs	`29.62% <ø> (ø)`
parquet/src/arrow/schema/complex.rs	`73.81% <73.81%> (ø)`
parquet/src/arrow/schema/primitive.rs	`76.99% <76.99%> (ø)`
parquet/src/arrow/array_reader/builder.rs	`93.50% <91.93%> (+24.53%)`	⬆️
parquet/src/arrow/schema.rs	`96.76% <92.50%> (+10.98%)`	⬆️
parquet/src/arrow/array_reader/list_array.rs	`93.35% <100.00%> (+0.07%)`	⬆️
parquet/src/arrow/arrow_writer.rs	`97.66% <100.00%> (ø)`
parquet/src/schema/types.rs	`83.83% <0.00%> (-1.85%)`	⬇️
arrow/src/datatypes/datatype.rs	`65.09% <0.00%> (-1.71%)`	⬇️
parquet/src/schema/visitor.rs	`66.66% <0.00%> (-1.34%)`	⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e02869a...5fd8cd8. Read the comment docs.

parquet/src/arrow/array_reader/builder.rs

alamb · 2022-05-11T12:53:54Z

I will review this later today

alamb · 2022-05-11T20:50:43Z

Sorry I don't think i will get to this today -- will do first thing tommorow. Sorry @tustvold 😞 I just need to find enough contiguous time to do the review and that is hard to come by

alamb

Thank you for taking on this challenge -- @tustvold 🏅 🏆

I won't say I totally grok all the changes in this PR, but I did read it carefully and it makes sense to me and seems to allow for easier improvements going forward. I like how the type logic is now encapsulated more.

Not sure if anyone who uses structured types in parquet (@bjchambers ? @TimDiekmann @jhorstmann ?) might be interested in testing their code with this PR.

I am not sure if anyone else wants a chance to review or if we should merge and include in arrow 14.0.0 (which I am starting to prepare for).

cc @sunchao @nevi-me @viirya

alamb · 2022-05-12T12:46:55Z

parquet/src/arrow/arrow_reader.rs

@@ -1050,6 +1050,41 @@ mod tests {
        for batch in record_batch_reader {
            batch.unwrap();
        }
+
+        let projected_reader = arrow_reader


Suggested change

let projected_reader = arrow_reader

// Test for https://github.com/apache/arrow-rs/issues/1654 and

// https://github.com/apache/arrow-rs/issues/1652

let projected_reader = arrow_reader

parquet/src/arrow/arrow_writer.rs

alamb · 2022-05-12T12:52:51Z

parquet/src/arrow/schema.rs

        DataType::Date32 => Type::primitive_type_builder(name, PhysicalType::INT32)
            .with_logical_type(Some(LogicalType::Date))
            .with_repetition(repetition)
            .build(),
-        // date64 is cast to date32
+        // date64 is cast to date32 (#1666)


alamb · 2022-05-12T12:54:43Z

parquet/src/arrow/schema.rs

        DataType::Date64 => Type::primitive_type_builder(name, PhysicalType::INT32)
            .with_logical_type(Some(LogicalType::Date))
            .with_repetition(repetition)
            .build(),
-        DataType::Time32(_) => Type::primitive_type_builder(name, PhysicalType::INT32)
+        DataType::Time32(TimeUnit::Second) => {
+            // Cannot represent seconds in LogicalType


Confirming this is a (seemingly better) change in behavior, right -- now no logical type is stored for arrow Time32(seconds) but previously the logical type of Time(millis) was stored,

Yes, TBC this wouldn't be wrong if the writer coerced the types to match. The problem is it does not

alamb · 2022-05-12T12:56:56Z

parquet/src/arrow/schema.rs

@@ -1261,7 +746,7 @@ mod tests {
        {
            arrow_fields.push(Field::new(
                "my_list",
-                DataType::List(Box::new(Field::new("element", DataType::Utf8, true))),
+                DataType::List(Box::new(Field::new("str", DataType::Utf8, false))),


I think the comments need to be updated for the changes in code

// // List<String> (list nullable, elements non-null) // optional group my_list (LIST) { // repeated group element { // required binary str (UTF8); // }; // }

parquet/src/arrow/array_reader/builder.rs

alamb · 2022-05-12T13:11:25Z

parquet/src/arrow/schema/complex.rs

+    column_mask: Vec<bool>,
+}
+
+impl Visitor {


I really like this structure to encapsulate the logic for using the embedded schema. 👍

alamb · 2022-05-12T13:12:57Z

parquet/src/arrow/schema/complex.rs

+    };
+
+    Ok(visitor.dispatch(parquet_type, context)?.unwrap())
+}


it would be awesome to add tests specifically for this logic that enumerated parquet types and their expected conversions to arrow.

Maybe that could be done as a follow on PR

There is fairly good coverage of the type conversion already in schema.rs, but there is definitely scope for testing repetition levels in addition. Filed #1698

alamb · 2022-05-12T13:14:03Z

parquet/src/arrow/schema/primitive.rs

+
+/// Uses an type hint from the embedded arrow schema to aid in faithfully
+/// reproducing the data as it was written into parquet
+fn apply_hint(parquet: DataType, hint: DataType) -> DataType {


I like this centralization of hinting logic -- it makes it easy to understand where arrow and parquet type systems aren't compatible

alamb · 2022-05-12T13:14:51Z

parquet/src/arrow/schema/primitive.rs

+        }
+        _ => Ok(DataType::FixedSizeBinary(type_length)),
+    }
+}


also here, some explicit tests showing conversions as a way to document expected behavior would be really nice

These type conversions are very well covered by the unit tests in schema.rs

tustvold · 2022-05-13T10:04:36Z

I think this is now ready, if I've missed anything let me know. I think it is worth highlighting as a breaking change in the changelog, so that on the off chance it does break something someone was relying on, even if it likely was a bug, they know where to look and we can hopefully quickly unblock them.

https://xkcd.com/1172/

tustvold · 2022-05-13T10:50:02Z

Looking into test failures

jhorstmann · 2022-05-13T12:07:14Z

I didn't review this in detail, but did run our test suite against this branch and did not notice any issues.

alamb · 2022-05-13T17:33:18Z

parquet/src/arrow/schema.rs

+                .unwrap_err()
+                .to_string();
+
+        assert!(


alamb · 2022-05-13T17:35:52Z

Merging (is the first PR in what will be released as arrow 15.0.0) 🎉

alamb · 2022-05-13T18:01:41Z

😅 glad I didn't try to include this in 14.0.0 -- see #1701

tustvold · 2022-05-13T18:03:46Z

@alamb that's unfortunately expected... DataFusion has a bug... Will provide context on ticket

github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels May 9, 2022

tustvold force-pushed the arrow-schema-conversion branch from 6eb932c to e2f12de Compare May 11, 2022 07:15

tustvold marked this pull request as ready for review May 11, 2022 07:18

tustvold added the api-change Changes to the arrow API label May 11, 2022

Add more tests

5fd8cd8

tustvold commented May 11, 2022

View reviewed changes

parquet/src/arrow/array_reader/builder.rs Show resolved Hide resolved

tustvold added 3 commits May 11, 2022 08:55

Pass pointers by reference

a824700

More docs

f2657e1

Fix lint

fff35d8

alamb mentioned this pull request May 12, 2022

Release next version of arrow-rs after 13.0.0 (14.0.0) #1692

Closed

4 tasks

alamb approved these changes May 12, 2022

View reviewed changes

tustvold mentioned this pull request May 13, 2022

Improve Unit Test Coverage of Parquet -> Arrow Converter #1698

Closed

tustvold added 2 commits May 13, 2022 10:54

Review feedback

0baa1aa

Review feedback

b58cd74

tustvold mentioned this pull request May 13, 2022

Improve Unit Test Coverage of ArrayReaderBuilder #1484

Closed

Fix test failures related to apache#1697

dd16ec9

tustvold mentioned this pull request May 13, 2022

Fix StructArrayReader handling nested lists (#1651) #1700

Merged

alamb approved these changes May 13, 2022

View reviewed changes

parquet/src/arrow/schema.rs

.unwrap_err()

.to_string();

assert!(

Copy link

Contributor

alamb May 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

alamb merged commit 5b154ea into apache:master May 13, 2022

alamb added a commit to alamb/datafusion that referenced this pull request May 13, 2022

Update to after apache/arrow-rs#1682

3545e4a

This was referenced May 13, 2022

Error after pre-release arrow upgrade: "out of order projection is not supported" (NOT FOR MERGING) apache/datafusion#2530

Closed

"out of order projection is not supported" after Fix Parquet Arrow Schema Inference #1701

Closed

This was referenced May 13, 2022

Empty array giving error apache/datafusion#2439

Closed

Reading parquet with (pre-release) arrow fails with "out of order projection is not supported" apache/datafusion#2543

Closed

alamb changed the title ~~Fix Parquet Arrow Schema Inference~~ Fix Parquet Reader's Arrow Schema Inference May 26, 2022

tustvold mentioned this pull request Jun 1, 2022

Handle Parquet Files With Inconsistent Timestamp Units #1459

Closed

tustvold mentioned this pull request Jun 23, 2022

unable to write parquet file with UTC timestamp #1932

Closed

tustvold mentioned this pull request Aug 17, 2022

Unsigned Arrays Fail to Roundtrip Through Parquet #2487

Closed

tustvold mentioned this pull request Sep 23, 2022

Fix Backwards Compatible Parquet List Encodings (#1915) #2774

Merged

This was referenced Oct 20, 2022

wrong result when operation parquet apache/datafusion#2044

Closed

Not able to read nano-second timestamp columns in 1.0 parquet files written by pyarrow #455

Closed

Fix Parquet Reader's Arrow Schema Inference #1682

Fix Parquet Reader's Arrow Schema Inference #1682

Conversation

tustvold commented May 9, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

alamb commented May 10, 2022

tustvold commented May 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold May 11, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented May 11, 2022

Codecov Report

alamb commented May 11, 2022

alamb commented May 11, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented May 13, 2022

tustvold commented May 13, 2022

jhorstmann commented May 13, 2022

Choose a reason for hiding this comment

alamb commented May 13, 2022

alamb commented May 13, 2022

tustvold commented May 13, 2022

tustvold commented May 9, 2022 •

edited

Loading

tustvold commented May 11, 2022 •

edited

Loading

tustvold May 11, 2022 •

edited

Loading