Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for nested list arrays from parquet to arrow arrays (#993) #1588

Merged
merged 11 commits into from
May 9, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Apr 18, 2022

Which issue does this PR close?

Closes #993.

Closes #720

Rationale for this change

Adds support for reading nested lists from parquet.

What changes are included in this PR?

Reworks the ListArrayReader to handle nested repetition levels.

Some drive by cleanup of ArrayReaderBuilder to error on encountering non-spec-compliant data

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Apr 18, 2022
let key_reader = {
let mut key_context = new_context.clone();
key_context.def_level += 1;
key_context.path.append(vec![map_key.name().to_string()]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a drive-by fix, the context is the context of the parent, not the value being dispatched. This would result in map_key appearing twice


// If the list can contain nulls
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend reading this with whitespace disabled https://github.com/apache/arrow-rs/pull/1588/files?w=1

)
}

fn test_nested_list<OffsetSize: OffsetSizeTrait>() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this wins the prize of the most 🤯 test I've ever written

new_context.def_level += 1;
new_context.rep_level += 1;
false
return Err(ArrowError(format!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just moves the error to earlier, there is no point continuing if this is not supported anyway 😁

ArrowType::List(_)
| ArrowType::FixedSizeList(_, _)
| ArrowType::Dictionary(_, _) => Err(ArrowError(format!(
"reading List({:?}) into arrow not supported yet",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the error that would previously be returned on a nested list

@codecov-commenter
Copy link

codecov-commenter commented May 4, 2022

Codecov Report

Merging #1588 (15b8144) into master (8f24c45) will increase coverage by 0.13%.
The diff coverage is 82.16%.

@@            Coverage Diff             @@
##           master    #1588      +/-   ##
==========================================
+ Coverage   83.02%   83.16%   +0.13%     
==========================================
  Files         193      193              
  Lines       55612    56005     +393     
==========================================
+ Hits        46174    46574     +400     
+ Misses       9438     9431       -7     
Impacted Files Coverage Δ
parquet/src/schema/visitor.rs 68.00% <0.00%> (+0.89%) ⬆️
parquet/src/arrow/array_reader/builder.rs 68.97% <59.77%> (+0.96%) ⬆️
parquet/src/arrow/array_reader/test_util.rs 81.08% <80.00%> (-2.26%) ⬇️
parquet/src/arrow/array_reader/list_array.rs 93.28% <93.22%> (+5.31%) ⬆️
parquet/src/arrow/array_reader.rs 89.76% <100.00%> (-0.07%) ⬇️
arrow/src/array/array.rs 89.67% <0.00%> (-3.04%) ⬇️
parquet_derive/src/parquet_field.rs 65.98% <0.00%> (-0.23%) ⬇️
arrow/src/ipc/reader.rs 88.80% <0.00%> (-0.23%) ⬇️
parquet/src/encodings/encoding.rs 93.37% <0.00%> (-0.19%) ⬇️
arrow/src/compute/kernels/take.rs 95.27% <0.00%> (-0.14%) ⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8f24c45...15b8144. Read the comment docs.

@tustvold tustvold changed the title WIP: Add support for nested list arrays (#993) Add support for nested list arrays (#993) May 4, 2022
@tustvold tustvold marked this pull request as ready for review May 4, 2022 16:35
let mut skipped = 0;

// Builder used to construct the filtered child data, skipping empty lists and nulls
let mut child_data_builder = MutableArrayData::new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched to using this over the take kernel as I think it makes it easier to follow what is going on, it should also be significantly faster where nulls and empty lists are rare

@tustvold
Copy link
Contributor Author

tustvold commented May 4, 2022

I've confirmed this can read nested_schema.parquet from @andrei-ionescu on apache/datafusion#1383.

I've also confirmed this can read https://github.com/Igosuki/arrow2/blob/main/part-00000-b4749aa1-94e4-4ddb-bab2-954c4d3a290f.c000.snappy.parquet and https://github.com/chauhanVritul/sampleparquet/blob/main/part-00000-8e5acb24-eb4e-491c-8c85-88799f25d1f0-c000.snappy.parquet provided by @Igosuki on #720

I've done a manual inspection that the output is the same as duckdb, and it looks good

@andrei-ionescu
Copy link

Thank you @tustvold

@alamb alamb changed the title Add support for nested list arrays (#993) Add support for nested list arrays from parquet to arrow arrays (#993) May 5, 2022
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say I am a deep expert in this area or this part of the codebase, but I did give the PR a careful "software engineering" review, with careful study of the tests. LGTM 👍

FWIW whitespace blind diff https://github.com/apache/arrow-rs/pull/1588/files?w=1 helped in my review

cc @nevi-me and @sunchao in case you would like to review

/// item type.
///
/// To fully understand this algorithm, please refer to
/// [parquet doc](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md).
///
/// For example, a standard list type looks like:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice document addition 👍

parquet/src/arrow/array_reader/list_array.rs Outdated Show resolved Hide resolved
parquet/src/arrow/array_reader/list_array.rs Outdated Show resolved Hide resolved
parquet/src/arrow/array_reader/list_array.rs Outdated Show resolved Hide resolved
.build()
.unwrap();

let offsets = to_offsets::<OffsetSize>(vec![0, 4, 4, 4, 5]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be valuable to have a sublist that has more than 1 item?

It appears that this test only has null, [] or a single element [x] list. I may be misunderstanding the intent of the test or what is possible in nested lists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a [1, null], can probably add more though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 05c2311

true,
);

let actual = l1.next_batch(1024).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it also make sense to test with l1.next_batch(2) (aka something that doesn't decode the entire array in one go?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 05c2311, required adding functionality to InMemoryArrayReader to actually respect it

@@ -1532,15 +1532,15 @@ mod tests {
ArrowType::Int32,
array_1.clone(),
Some(vec![0, 1, 2, 3, 1]),
Some(vec![1, 1, 1, 1, 1]),
Some(vec![0, 1, 1, 1, 1]),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by fix, this set of levels is impossible, as the first rep level must be 0. Nothing cares as this test doesn't actually decode the implied list, but 🤷

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at the changes, especially on ListArrayReader. I'm not Parquet expert too, but the change looks reasonable when looking at with the added comments. 👍

Comment on lines +38 to +43
/// The definition level at which this list is not null
def_level: i16,
/// The repetition level that corresponds to a new value in this array
rep_level: i16,
/// If this list is nullable
nullable: bool,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment looks clear. 👍

/// item type.
///
/// To fully understand this algorithm, please refer to
/// [parquet doc](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md).
///
/// For example, a standard list type looks like:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice document addition 👍

@tustvold tustvold merged commit edd96c0 into apache:master May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Reading Nested List Arrays Support for nested data types
5 participants