Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Parquet Reader's Arrow Schema Inference #1682

Merged
merged 8 commits into from
May 13, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented May 9, 2022

Which issue does this PR close?

Closes #1655
Closes #1663
Closes #1652
Closes #1654
Closes #1681
Closes #1680
Closes #1484

Rationale for this change

See tickets, in particular #1655

What changes are included in this PR?

This separates the schema inference logic from the logic that reads the parquet file, this makes the logic clearer, easier to test, and hopefully less buggy.

Are there any user-facing changes?

Yes, schema inference may change. It will be more correct, but this is still a change.

We also explicitly no longer support out-of-order column projection, whereas previously it would be silently ignored in some code paths.

@github-actions github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels May 9, 2022
@alamb
Copy link
Contributor

alamb commented May 10, 2022

@tustvold please let me know when you would like any substantial review for this

Don't treat embedded arrow schema as authoritative (apache#1663)

Fix projection of nested parquet files (apache#1652) (apache#1654)

Fix schema inference for repeated fields (apache#1681)

Support reading alternative list representations from parquet (apache#1680)
@tustvold
Copy link
Contributor Author

tustvold commented May 11, 2022

Ok I've backed out the changes related to #1666 from this PR, so this should preserve the existing schema inference behaviour, and I think is ready for review

I'm confident that this PR will lay the ground work to make #1666 relatively straightforward.

@tustvold tustvold marked this pull request as ready for review May 11, 2022 07:18
@tustvold tustvold added the api-change Changes to the arrow API label May 11, 2022
}
}

impl ParquetTypeConverter<'_> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is copied largely wholesale into schema/primitive.rs

@@ -1261,7 +746,7 @@ mod tests {
{
arrow_fields.push(Field::new(
"my_list",
DataType::List(Box::new(Field::new("element", DataType::Utf8, true))),
DataType::List(Box::new(Field::new("str", DataType::Utf8, false))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we can see this fixing #1681

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comments need to be updated for the changes in code

        // // List<String> (list nullable, elements non-null)
        // optional group my_list (LIST) {
        //   repeated group element {
        //     required binary str (UTF8);
        //   };
        // }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment is still correct, this is a nullable list with non-nullable elements, as described by that parquet schema.

The test was previously wrong

@@ -1679,7 +1168,7 @@ mod tests {

let parquet_schema = SchemaDescriptor::new(Arc::new(parquet_group_type));
let converted_arrow_schema =
parquet_to_arrow_schema_by_columns(&parquet_schema, vec![3, 4, 0], None)
parquet_to_arrow_schema_by_columns(&parquet_schema, vec![0, 3, 4], None)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of order column projection previously would misbehave as parquet_to_arrow_schema_by_columns supported it, but the actual reader logic did not. This makes it consistently not supported, it will error, as it is hard to reason what the correct semantics are in the event of nested schema

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a test for the(new) error case -- I suggest adding one so we don't get accidental regressions

column_mask: Vec<bool>,
}

impl Visitor {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is logic extracted from builder.rs, it wasn't possible to reuse the existing TypeVisitor as its handling of lists interfered with #1680

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this structure to encapsulate the logic for using the embedded schema. 👍

};

Ok(Some(match repetition {
Repetition::REPEATED => primitive_field.into_list(primitive_type.name()),
Copy link
Contributor Author

@tustvold tustvold May 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the logic to now support #1680, there is comprehensive test coverage of this in schema.rs, in particular test_arrow_schema_roundtrip.

I'm actually quite pleased with this, despite the underlying list representation in parquet being fundamentally different, the ArrayBuilder can be completely oblivious to this fact 😄


/// Uses an type hint from the embedded arrow schema to aid in faithfully
/// reproducing the data as it was written into parquet
fn apply_hint(parquet: DataType, hint: DataType) -> DataType {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change that fixes #1663 - we only use the arrow schema to hint types

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this centralization of hinting logic -- it makes it easy to understand where arrow and parquet type systems aren't compatible

rep_level: i16,
def_level: i16,
/// An optional [`DataType`] sourced from the embedded arrow schema
data_type: Option<DataType>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what fixes #1654 - we carry the DataType as we walk the tree, which prevents it from misbehaving

}

/// Representation of a parquet file, in terms of arrow schema elements
pub struct ParquetField {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the new structure as described in #1655

@@ -1050,6 +1050,41 @@ mod tests {
for batch in record_batch_reader {
batch.unwrap();
}

let projected_reader = arrow_reader
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a test for #1654 and #1652

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let projected_reader = arrow_reader
// Test for https://github.com/apache/arrow-rs/issues/1654 and
// https://github.com/apache/arrow-rs/issues/1652
let projected_reader = arrow_reader

@codecov-commenter
Copy link

Codecov Report

Merging #1682 (e2f12de) into master (e02869a) will increase coverage by 0.10%.
The diff coverage is 80.62%.

❗ Current head e2f12de differs from pull request most recent head 5fd8cd8. Consider uploading reports for the commit 5fd8cd8 to get more accurate results

@@            Coverage Diff             @@
##           master    #1682      +/-   ##
==========================================
+ Coverage   83.15%   83.25%   +0.10%     
==========================================
  Files         193      195       +2     
  Lines       56007    56049      +42     
==========================================
+ Hits        46572    46665      +93     
+ Misses       9435     9384      -51     
Impacted Files Coverage Δ
parquet/src/errors.rs 29.62% <ø> (ø)
parquet/src/arrow/schema/complex.rs 73.81% <73.81%> (ø)
parquet/src/arrow/schema/primitive.rs 76.99% <76.99%> (ø)
parquet/src/arrow/array_reader/builder.rs 93.50% <91.93%> (+24.53%) ⬆️
parquet/src/arrow/schema.rs 96.76% <92.50%> (+10.98%) ⬆️
parquet/src/arrow/array_reader/list_array.rs 93.35% <100.00%> (+0.07%) ⬆️
parquet/src/arrow/arrow_writer.rs 97.66% <100.00%> (ø)
parquet/src/schema/types.rs 83.83% <0.00%> (-1.85%) ⬇️
arrow/src/datatypes/datatype.rs 65.09% <0.00%> (-1.71%) ⬇️
parquet/src/schema/visitor.rs 66.66% <0.00%> (-1.34%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e02869a...5fd8cd8. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented May 11, 2022

I will review this later today

@alamb
Copy link
Contributor

alamb commented May 11, 2022

Sorry I don't think i will get to this today -- will do first thing tommorow. Sorry @tustvold 😞 I just need to find enough contiguous time to do the review and that is hard to come by

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for taking on this challenge -- @tustvold 🏅 🏆

I won't say I totally grok all the changes in this PR, but I did read it carefully and it makes sense to me and seems to allow for easier improvements going forward. I like how the type logic is now encapsulated more.

Not sure if anyone who uses structured types in parquet (@bjchambers ? @TimDiekmann @jhorstmann ?) might be interested in testing their code with this PR.

I am not sure if anyone else wants a chance to review or if we should merge and include in arrow 14.0.0 (which I am starting to prepare for).

cc @sunchao @nevi-me @viirya

@@ -1050,6 +1050,41 @@ mod tests {
for batch in record_batch_reader {
batch.unwrap();
}

let projected_reader = arrow_reader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let projected_reader = arrow_reader
// Test for https://github.com/apache/arrow-rs/issues/1654 and
// https://github.com/apache/arrow-rs/issues/1652
let projected_reader = arrow_reader

parquet/src/arrow/arrow_writer.rs Outdated Show resolved Hide resolved
DataType::Date32 => Type::primitive_type_builder(name, PhysicalType::INT32)
.with_logical_type(Some(LogicalType::Date))
.with_repetition(repetition)
.build(),
// date64 is cast to date32
// date64 is cast to date32 (#1666)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 #1666

DataType::Date64 => Type::primitive_type_builder(name, PhysicalType::INT32)
.with_logical_type(Some(LogicalType::Date))
.with_repetition(repetition)
.build(),
DataType::Time32(_) => Type::primitive_type_builder(name, PhysicalType::INT32)
DataType::Time32(TimeUnit::Second) => {
// Cannot represent seconds in LogicalType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming this is a (seemingly better) change in behavior, right -- now no logical type is stored for arrow Time32(seconds) but previously the logical type of Time(millis) was stored,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, TBC this wouldn't be wrong if the writer coerced the types to match. The problem is it does not

@@ -1261,7 +746,7 @@ mod tests {
{
arrow_fields.push(Field::new(
"my_list",
DataType::List(Box::new(Field::new("element", DataType::Utf8, true))),
DataType::List(Box::new(Field::new("str", DataType::Utf8, false))),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comments need to be updated for the changes in code

        // // List<String> (list nullable, elements non-null)
        // optional group my_list (LIST) {
        //   repeated group element {
        //     required binary str (UTF8);
        //   };
        // }

parquet/src/arrow/array_reader/builder.rs Outdated Show resolved Hide resolved
column_mask: Vec<bool>,
}

impl Visitor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like this structure to encapsulate the logic for using the embedded schema. 👍

};

Ok(visitor.dispatch(parquet_type, context)?.unwrap())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be awesome to add tests specifically for this logic that enumerated parquet types and their expected conversions to arrow.

Maybe that could be done as a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is fairly good coverage of the type conversion already in schema.rs, but there is definitely scope for testing repetition levels in addition. Filed #1698


/// Uses an type hint from the embedded arrow schema to aid in faithfully
/// reproducing the data as it was written into parquet
fn apply_hint(parquet: DataType, hint: DataType) -> DataType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this centralization of hinting logic -- it makes it easy to understand where arrow and parquet type systems aren't compatible

}
_ => Ok(DataType::FixedSizeBinary(type_length)),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here, some explicit tests showing conversions as a way to document expected behavior would be really nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These type conversions are very well covered by the unit tests in schema.rs

@tustvold
Copy link
Contributor Author

I think this is now ready, if I've missed anything let me know. I think it is worth highlighting as a breaking change in the changelog, so that on the off chance it does break something someone was relying on, even if it likely was a bug, they know where to look and we can hopefully quickly unblock them.

https://xkcd.com/1172/

@tustvold
Copy link
Contributor Author

Looking into test failures

@jhorstmann
Copy link
Contributor

I didn't review this in detail, but did run our test suite against this branch and did not notice any issues.

.unwrap_err()
.to_string();

assert!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented May 13, 2022

Merging (is the first PR in what will be released as arrow 15.0.0) 🎉

@alamb
Copy link
Contributor

alamb commented May 13, 2022

😅 glad I didn't try to include this in 14.0.0 -- see #1701

@tustvold
Copy link
Contributor Author

@alamb that's unfortunately expected... DataFusion has a bug... Will provide context on ticket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment