-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-10766: [Rust] [Parquet] Compute nested list definitions #9240
Conversation
Hi everyone interested in the Parquet writer. This PR effectively gives us the ability to compute how to write arbitrarily nested types. It has the side effect that nested lists can also be written. I'll open JIRAs as I go along. For reviewers, please note: This has taken me a few months on weekends to get right. I've iterated over various solutions to arrive here. I've spent far too long on this, so I practically don't have any fresh eyes here. I worked on all the edge-cases that I could think with lists and structs. I've documented them, but I'll review the doc comments and add more detail where I still feel that it's lacking. Thank you ❤️ |
Codecov Report
@@ Coverage Diff @@
## master #9240 +/- ##
==========================================
+ Coverage 81.61% 81.75% +0.14%
==========================================
Files 215 215
Lines 51867 52400 +533
==========================================
+ Hits 42329 42839 +510
- Misses 9538 9561 +23
Continue to review full report at Codecov.
|
|
||
#[test] | ||
fn test_filter_array_indices() { | ||
let level = LevelInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't have enough test cases to compute here. I'll add more when I increase coverage of deeply-nested lists
This mainly computes definition and repetition leves for lists. It also partially adds deeply nested write support. I am however going to complete this in a separate PR. This has really been challenging because we can't roundtrip without nested writers, so it's taken me months to complete. In the process, I've had to rely on using Spark to verify my work. This PR is also not optimised. I've left TODOs in a few places (sparingly). The biggest next step is to remove array_mask: Vec<u8> and replace it with a bitpacked vector to save memory.
75487cf
to
a59613b
Compare
@sunchao may you please kindly review this when you get a chance |
Sure. Thanks for pinging me. Will take a look soon. |
| ArrowType::LargeList(field) => field, | ||
_ => { | ||
// Panic: this is safe as we only write lists from list datatypes | ||
unreachable!() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my own curiosity, is there a specific reason to use unreachable!
rather than panic!
in cases like this? I understand the outcome will be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They don't make any difference, unreachable!()
will call panic!()
with a message about unreachable code being reached. So it's probably a more descriptive panic.
I tohught that marking a condition as unreachable!()
lets the compiler optimise out that condition, but it seems like only its unsafe
equivalent does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I did not grasp all the details here, I think that this is ready to merge.
Impressive work, @nevi-me 💯
@alamb given that this is a 1k+ line PR, could you give us a chance to review it properly before eagerly merging it? |
I am really sorry @sunchao -- I missed your earlier comment that you would be reviewing this more carefully. I have been trying to clear out the backlog of Rust PRs and I agree I was too eager on this one Would you like me to revert this PR and prepare a new one to re-merge? |
No worries @alamb . I'll do review on this closed PR and we can address any feedback in followups. |
😓 Thank you for understanding @sunchao |
This mainly computes definition and repetition leves for lists. It also partially adds deeply nested write support. I am however going to complete this in a separate PR. This has really been challenging because we can't roundtrip without nested writers, so it's taken me months to complete. In the process, I've had to rely on using Spark to verify my work. This PR is also not optimised. I've left TODOs in a few places (sparingly). The biggest next step is to remove array_mask: Vec<u8> and replace it with a bitpacked vector to save memory. Closes #9240 from nevi-me/ARROW-10766-v2 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This mainly computes definition and repetition leves for lists. It also partially adds deeply nested write support. I am however going to complete this in a separate PR. This has really been challenging because we can't roundtrip without nested writers, so it's taken me months to complete. In the process, I've had to rely on using Spark to verify my work. This PR is also not optimised. I've left TODOs in a few places (sparingly). The biggest next step is to remove array_mask: Vec<u8> and replace it with a bitpacked vector to save memory. Closes apache#9240 from nevi-me/ARROW-10766-v2 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This mainly computes definition and repetition leves for lists. It also partially adds deeply nested write support. I am however going to complete this in a separate PR. This has really been challenging because we can't roundtrip without nested writers, so it's taken me months to complete. In the process, I've had to rely on using Spark to verify my work. This PR is also not optimised. I've left TODOs in a few places (sparingly). The biggest next step is to remove array_mask: Vec<u8> and replace it with a bitpacked vector to save memory. Closes apache#9240 from nevi-me/ARROW-10766-v2 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This mainly computes definition and repetition leves for lists.
It also partially adds deeply nested write support.
I am however going to complete this in a separate PR.
This has really been challenging because we can't roundtrip without nested writers,
so it's taken me months to complete.
In the process, I've had to rely on using Spark to verify my work.
This PR is also not optimised. I've left TODOs in a few places (sparingly).
The biggest next step is to remove array_mask: Vec and replace it with a bitpacked vector to save memory.