-
Notifications
You must be signed in to change notification settings - Fork 786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix null struct and list roundtrip #270
Conversation
@alamb @jorgecarleitao this is a big one for me, as I've been able to verify that the writer does what it's supposed to do with mixed nested types. |
Codecov Report
@@ Coverage Diff @@
## master #270 +/- ##
==========================================
+ Coverage 82.53% 82.56% +0.02%
==========================================
Files 162 162
Lines 43796 43822 +26
==========================================
+ Hits 36149 36180 +31
+ Misses 7647 7642 -5
Continue to review full report at Codecov.
|
Hi @nevi-me -- thank you. I will probably will not have a chance to review this fully until tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what an amazing PR, @nevi-me !
I was unable to follow smaller details, but overall I can understand what was done and it looks a great improvement to me.
I agree that it is necessary to have list_empty_def_level
and list_null_def_level
, and I also agree that LevelType
makes a lot of sense.
Again thanks a lot and amazing work 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like @jorgecarleitao I would say I did not follow all the details of this PR, but I did review all the test changes carefully and they look good to me. I think we can merge this
Nice work @nevi-me
@jorgecarleitao @alamb there's still errors with the more complex cases like I discovered the issue by chance when trying to benchmark the writer with deeply nested lists and structs. |
f779e23
to
a9fb246
Compare
I've opened #282 to track the remaining issue |
Great job @nevi-me |
Which issue does this PR close?
Closes #245 .
Rationale for this change
This addresses bugs in the Rust Parquet writer and reader, where we were:
What changes are included in this PR?
This PR:
LevelType
enum, that has enough information about the Arrow types when calculating levels. This is a lighter solution that passing Arrow fields around when computing levels, and could allow us to reuse the levels logic elsewhere in the codebase.In working on this PR, I:
pyarrow
, and wrote it to disk withpyarrow.parquet
pyarrow.parquet
, and confirmed that the results were identicalpyspark
, and confirmed that the results were identicalAn interesting observation is that
pyspark
always interpreted the parquet columns as all nullable.Are there any user-facing changes?
All changes are within crate-level structs