Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading nested lists from parquet files #1517

Merged
merged 2 commits into from
Apr 7, 2022
Merged

Conversation

viirya
Copy link
Member

@viirya viirya commented Apr 1, 2022

Which issue does this PR close?

Closes #1515.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 1, 2022
@viirya viirya mentioned this pull request Apr 1, 2022
Comment on lines +703 to +708
ArrowType::List(list_field) => match list_field.data_type() {
ArrowType::Struct(fields) => {
field = fields.iter().find(|f| f.name() == part)
}
_ => field = Some(list_field.as_ref()),
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was tested by using the example in #1515. Not sure if it is easy to add a unit test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we put the example parquet file into parquet-testing too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we would construct a parquet file with this pattern programmatically (aka write it out in a test and then read it back in).

if that is not possible / feasible, adding an example in parquet-testing would be second best

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Let me try to make one. Thanks.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for driving this @viirya

Comment on lines +703 to +708
ArrowType::List(list_field) => match list_field.data_type() {
ArrowType::Struct(fields) => {
field = fields.iter().find(|f| f.name() == part)
}
_ => field = Some(list_field.as_ref()),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we would construct a parquet file with this pattern programmatically (aka write it out in a test and then read it back in).

if that is not possible / feasible, adding an example in parquet-testing would be second best

@alamb alamb changed the title Fix get_arrow_field for List Fix reading nested lists from parquet files Apr 4, 2022
@viirya
Copy link
Member Author

viirya commented Apr 5, 2022

Interesting, I cannot reproduce the Parquet file by writing a simple Parquet file in test.

The main difference is:

For the original Parquet file, the path to the name column is:

 ["table", "table_info", "name"]

Because table_info is a list of struct, so we need the change in this PR.

I use the same message type to write out a Parquet file in test. But when I read it back. For the same name column, the column path is different:

["arrow_schema", "table_info", "table_info", "name"]

For the first table_info, get_arrow_field will simply pick up same name field from arrow schema, which is the list field. Then moving to next table_info, it picks the struct field from there. Next part is name, and current field is struct, it can get the correct field.

I checked the arrow schema for the Parquet file and the one I wrote in test, they are the same. So I am currently not sure why the column paths for same column name could be different for two cases.

Update: The different column path should not be an issue. The message type from the issue and the one I get from the Parquet file using arrow-rs, are the the same. I think it is because of that. But both cases we should be able to read them.

@viirya
Copy link
Member Author

viirya commented Apr 5, 2022

Let me dig why there is the difference.

Comment on lines -324 to -327
let item_reader = self
.dispatch(item_type.clone(), &new_context)
.unwrap()
.unwrap();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a column is not included, dispatch will return None. unwrap here will get unexpected error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I found using a whitespace blind diff made the changes easier to understand: https://github.com/apache/arrow-rs/pull/1517/files?w=1

Comment on lines +2280 to +2295
message table {
REPEATED group table_info {
REQUIRED BYTE_ARRAY name;
REPEATED group cols {
REQUIRED BYTE_ARRAY name;
REQUIRED INT32 type;
OPTIONAL INT32 length;
}
REPEATED group tags {
REQUIRED BYTE_ARRAY name;
REQUIRED INT32 type;
OPTIONAL INT32 length;
}
}
}
";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message type is copied from the issue.

@codecov-commenter
Copy link

Codecov Report

Merging #1517 (56d718f) into master (c5442cf) will increase coverage by 0.10%.
The diff coverage is 53.70%.

@@            Coverage Diff             @@
##           master    #1517      +/-   ##
==========================================
+ Coverage   82.68%   82.79%   +0.10%     
==========================================
  Files         188      190       +2     
  Lines       54361    54717     +356     
==========================================
+ Hits        44951    45301     +350     
- Misses       9410     9416       +6     
Impacted Files Coverage Δ
parquet/src/arrow/array_reader.rs 87.07% <ø> (+2.71%) ⬆️
parquet/src/arrow/array_reader/builder.rs 67.91% <53.70%> (+1.61%) ⬆️
arrow/src/array/equal/utils.rs 74.45% <0.00%> (-3.24%) ⬇️
arrow/src/array/transform/mod.rs 86.35% <0.00%> (-0.23%) ⬇️
arrow/src/array/builder.rs 86.68% <0.00%> (-0.05%) ⬇️
arrow/src/compute/kernels/filter.rs 88.00% <0.00%> (ø)
arrow/src/compute/kernels/length.rs 100.00% <0.00%> (ø)
arrow/src/array/equal/union.rs 100.00% <0.00%> (ø)
arrow/src/ffi_stream.rs 79.03% <0.00%> (ø)
arrow/src/buffer/mutable.rs 90.53% <0.00%> (+0.02%) ⬆️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5442cf...56d718f. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @viirya

@@ -2268,4 +2272,67 @@ mod tests {
&PrimitiveArray::<ArrowInt32>::from(vec![Some(3), Some(4)])
);
}

#[test]
fn test_nested_lists() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested running this test without the rest of the code change in this PR and I got:

---- arrow::array_reader::tests::test_nested_lists stdout ----
thread 'arrow::array_reader::tests::test_nested_lists' panicked at 'called `Option::unwrap()` on a `None` value', parquet/src/arrow/array_reader/builder.rs:327:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Comment on lines -324 to -327
let item_reader = self
.dispatch(item_type.clone(), &new_context)
.unwrap()
.unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I found using a whitespace blind diff made the changes easier to understand: https://github.com/apache/arrow-rs/pull/1517/files?w=1

@alamb alamb merged commit f0f91e8 into apache:master Apr 7, 2022
@alamb
Copy link
Contributor

alamb commented Apr 7, 2022

Thanks @viirya

@viirya
Copy link
Member Author

viirya commented Apr 7, 2022

Thank you @alamb !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cannot read parquet file
3 participants