Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10766: [Rust] [Parquet] Compute nested list definitions #9240

Closed
wants to merge 2 commits into from

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Jan 18, 2021

This mainly computes definition and repetition leves for lists.
It also partially adds deeply nested write support.
I am however going to complete this in a separate PR.

This has really been challenging because we can't roundtrip without nested writers,
so it's taken me months to complete.
In the process, I've had to rely on using Spark to verify my work.

This PR is also not optimised. I've left TODOs in a few places (sparingly).
The biggest next step is to remove array_mask: Vec and replace it with a bitpacked vector to save memory.

@nevi-me nevi-me requested a review from sunchao January 18, 2021 00:53
@github-actions
Copy link

@nevi-me
Copy link
Contributor Author

nevi-me commented Jan 18, 2021

Hi everyone interested in the Parquet writer.

This PR effectively gives us the ability to compute how to write arbitrarily nested types. It has the side effect that nested lists can also be written.
There's a few places where I need to tidy up, but they're dependent on the Arrow reader (ARROW-10391), which unfortunately might be a lot of work on its own. I'm a bit worried that I might have to rework a fair share of the writer to handle nesting correctly. I've already seen instances where we don't always have enough information to arrive at the correct solution.

I'll open JIRAs as I go along.

For reviewers, please note:

This has taken me a few months on weekends to get right. I've iterated over various solutions to arrive here.
The implementation is not optimal (I haven't benchmarked the latest impl), but I'm confident that it's correct.
the extensive tests on the levels.rs will allow us to refactor with some confidence.

I've spent far too long on this, so I practically don't have any fresh eyes here. I worked on all the edge-cases that I could think with lists and structs. I've documented them, but I'll review the doc comments and add more detail where I still feel that it's lacking.

Thank you ❤️

@codecov-io
Copy link

Codecov Report

Merging #9240 (75487cf) into master (1393188) will increase coverage by 0.14%.
The diff coverage is 85.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9240      +/-   ##
==========================================
+ Coverage   81.61%   81.75%   +0.14%     
==========================================
  Files         215      215              
  Lines       51867    52400     +533     
==========================================
+ Hits        42329    42839     +510     
- Misses       9538     9561      +23     
Impacted Files Coverage Δ
rust/parquet/src/arrow/array_reader.rs 75.00% <71.42%> (+3.59%) ⬆️
rust/parquet/src/arrow/levels.rs 81.55% <84.69%> (+9.71%) ⬆️
rust/parquet/src/arrow/arrow_writer.rs 96.78% <100.00%> (+1.20%) ⬆️
rust/parquet/src/arrow/schema.rs 91.67% <100.00%> (+0.17%) ⬆️
rust/arrow/src/ipc/gen/Schema.rs 40.43% <0.00%> (+0.71%) ⬆️
rust/arrow/src/array/equal/utils.rs 74.75% <0.00%> (+0.97%) ⬆️
rust/arrow/src/ipc/convert.rs 93.67% <0.00%> (+1.58%) ⬆️
rust/arrow/src/array/transform/fixed_binary.rs 84.21% <0.00%> (+5.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 71b6b9c...75487cf. Read the comment docs.

rust/parquet/src/arrow/levels.rs Outdated Show resolved Hide resolved
rust/parquet/src/arrow/levels.rs Outdated Show resolved Hide resolved
rust/parquet/src/arrow/levels.rs Outdated Show resolved Hide resolved
rust/parquet/src/arrow/levels.rs Show resolved Hide resolved

#[test]
fn test_filter_array_indices() {
let level = LevelInfo {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't have enough test cases to compute here. I'll add more when I increase coverage of deeply-nested lists

This mainly computes definition and repetition leves for lists.
It also partially adds deeply nested write support.
I am however going to complete this in a separate PR.

This has really been challenging because we can't roundtrip without nested writers,
so it's taken me months to complete.
In the process, I've had to rely on using Spark to verify my work.

This PR is also not optimised. I've left TODOs in a few places (sparingly).
The biggest next step is to remove array_mask: Vec<u8> and replace it with a bitpacked vector to save memory.
@nevi-me
Copy link
Contributor Author

nevi-me commented Jan 20, 2021

@sunchao may you please kindly review this when you get a chance

@sunchao
Copy link
Member

sunchao commented Jan 20, 2021

Sure. Thanks for pinging me. Will take a look soon.

| ArrowType::LargeList(field) => field,
_ => {
// Panic: this is safe as we only write lists from list datatypes
unreachable!()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my own curiosity, is there a specific reason to use unreachable! rather than panic! in cases like this? I understand the outcome will be the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't make any difference, unreachable!() will call panic!() with a message about unreachable code being reached. So it's probably a more descriptive panic.

I tohught that marking a condition as unreachable!() lets the compiler optimise out that condition, but it seems like only its unsafe equivalent does.

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I did not grasp all the details here, I think that this is ready to merge.

Impressive work, @nevi-me 💯

@alamb alamb closed this in 251ecac Jan 22, 2021
@sunchao
Copy link
Member

sunchao commented Jan 22, 2021

@alamb given that this is a 1k+ line PR, could you give us a chance to review it properly before eagerly merging it?

@alamb
Copy link
Contributor

alamb commented Jan 22, 2021

I am really sorry @sunchao -- I missed your earlier comment that you would be reviewing this more carefully. I have been trying to clear out the backlog of Rust PRs and I agree I was too eager on this one

Would you like me to revert this PR and prepare a new one to re-merge?

@sunchao
Copy link
Member

sunchao commented Jan 22, 2021

No worries @alamb . I'll do review on this closed PR and we can address any feedback in followups.

@alamb
Copy link
Contributor

alamb commented Jan 22, 2021

😓 Thank you for understanding @sunchao

kszucs pushed a commit that referenced this pull request Jan 25, 2021
This mainly computes definition and repetition leves for lists.
It also partially adds deeply nested write support.
I am however going to complete this in a separate PR.

This has really been challenging because we can't roundtrip without nested writers,
so it's taken me months to complete.
In the process, I've had to rely on using Spark to verify my work.

This PR is also not optimised. I've left TODOs in a few places (sparingly).
The biggest next step is to remove array_mask: Vec<u8> and replace it with a bitpacked vector to save memory.

Closes #9240 from nevi-me/ARROW-10766-v2

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This mainly computes definition and repetition leves for lists.
It also partially adds deeply nested write support.
I am however going to complete this in a separate PR.

This has really been challenging because we can't roundtrip without nested writers,
so it's taken me months to complete.
In the process, I've had to rely on using Spark to verify my work.

This PR is also not optimised. I've left TODOs in a few places (sparingly).
The biggest next step is to remove array_mask: Vec<u8> and replace it with a bitpacked vector to save memory.

Closes apache#9240 from nevi-me/ARROW-10766-v2

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This mainly computes definition and repetition leves for lists.
It also partially adds deeply nested write support.
I am however going to complete this in a separate PR.

This has really been challenging because we can't roundtrip without nested writers,
so it's taken me months to complete.
In the process, I've had to rely on using Spark to verify my work.

This PR is also not optimised. I've left TODOs in a few places (sparingly).
The biggest next step is to remove array_mask: Vec<u8> and replace it with a bitpacked vector to save memory.

Closes apache#9240 from nevi-me/ARROW-10766-v2

Authored-by: Neville Dipale <nevilledips@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants