Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read/write nested dictionary in ipc stream reader/writer #1566

Merged
merged 3 commits into from
Apr 15, 2022

Conversation

viirya
Copy link
Member

@viirya viirya commented Apr 13, 2022

Which issue does this PR close?

Closes #1565.

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

@github-actions github-actions bot added the arrow Changes to the arrow crate label Apr 13, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #1566 (22f3908) into master (d20ebc0) will decrease coverage by 0.01%.
The diff coverage is 74.00%.

@@            Coverage Diff             @@
##           master    #1566      +/-   ##
==========================================
- Coverage   82.84%   82.83%   -0.02%     
==========================================
  Files         190      190              
  Lines       54985    55031      +46     
==========================================
+ Hits        45552    45584      +32     
- Misses       9433     9447      +14     
Impacted Files Coverage Δ
arrow/src/ipc/writer.rs 81.80% <52.00%> (-1.35%) ⬇️
arrow/src/datatypes/field.rs 54.62% <85.71%> (+0.52%) ⬆️
arrow/src/ipc/reader.rs 88.61% <100.00%> (+0.32%) ⬆️
arrow/src/datatypes/datatype.rs 66.40% <0.00%> (-0.40%) ⬇️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (-0.23%) ⬇️
parquet/src/encodings/encoding.rs 93.37% <0.00%> (-0.19%) ⬇️
arrow/src/array/transform/mod.rs 86.46% <0.00%> (+0.11%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d20ebc0...22f3908. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good to me -- thank you @viirya

DataType::Struct(fields) | DataType::Union(fields, _) => {
collected_fields.extend(fields.iter().flat_map(|f| f.fields()))
}
DataType::List(field)
| DataType::LargeList(field)
| DataType::FixedSizeList(field, _)
| DataType::Map(field, _) => collected_fields.push(field),
DataType::Dictionary(_, value_field) => {
collected_fields.append(&mut self._fields(value_field.as_ref()))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is worth adding an error for types that aren't explicitly supported (I realize this will likely need a bunch of plumbing to return Result)

like

_ => return Err("Type not supported")

DataType::Struct(fields) | DataType::Union(fields, _) => {
collected_fields.extend(fields.iter().flat_map(|f| f.fields()))
}
DataType::List(field)
| DataType::LargeList(field)
| DataType::FixedSizeList(field, _)
| DataType::Map(field, _) => collected_fields.push(field),
DataType::Dictionary(_, value_field) => {
collected_fields.append(&mut self._fields(value_field.as_ref()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the key improvement of the PR is that this now recurses into children, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this is key part here to get fields inside dictionary type.

@@ -1439,4 +1440,42 @@ mod tests {
let output_batch = roundtrip_ipc_stream(&input_batch);
assert_eq!(input_batch, output_batch);
}

#[test]
fn test_roundtrip_stream_nested_dict_dict() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running this test without the code changes in this PR and it failed thusly. 👍

---- ipc::reader::tests::test_roundtrip_stream_nested_dict_dict stdout ----
thread 'ipc::reader::tests::test_roundtrip_stream_nested_dict_dict' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidArgumentError("dictionary id not found in schema")', arrow/src/ipc/reader.rs:1331:32

when I removed the field code

and like this when I removed the writer support

---- ipc::reader::tests::test_roundtrip_stream_nested_dict_dict stdout ----
thread 'ipc::reader::tests::test_roundtrip_stream_nested_dict_dict' panicked at 'called `Option::unwrap()` on a `None` value', arrow/src/ipc/reader.rs:176:64

@@ -175,13 +186,37 @@ impl IpcDataGenerator {
)?;
}
}
_ => (),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this should return an error for unsupported nested types

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I guess that it doesn't just returning an error, is because if no dictionaries exist inside the fields of these nested types, then the ipc writer still can work. To return an error for dictionaries existing inside unsupported nested types, we need to look into them as supported types, so it's basically almost near to implement them.

I may take some time continuing the work to add more supports (map, etc) here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Read/write nested dictionary in ipc stream reader/writer
3 participants