Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix null struct and list roundtrip #270

Merged
merged 3 commits into from
May 11, 2021

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented May 8, 2021

Which issue does this PR close?

Closes #245 .

Rationale for this change

This addresses bugs in the Rust Parquet writer and reader, where we were:

  • Not reading lists correctly when they have null bitmaps vs not. We were creating null bitmaps even when a list was non-null.
  • Not incrementing definitions correctly for some combinations of lists and structs

What changes are included in this PR?

This PR:

  • Fixes the reader, by making roundtrip tests pass under conditions that were previously failing (mostly when lists are set as non-nullable).
  • Combines a few loose variables into a LevelType enum, that has enough information about the Arrow types when calculating levels. This is a lighter solution that passing Arrow fields around when computing levels, and could allow us to reuse the levels logic elsewhere in the codebase.
  • Enables nullability conditions that were failing

In working on this PR, I:

  • Wrote the rest recordbatch to an IPC file
  • Wrote the test recordbatch to parquet
  • Read the file with pyarrow, and wrote it to disk with pyarrow.parquet
  • Read both parquet files with pyarrow.parquet, and confirmed that the results were identical
  • Read both parquet files with pyspark, and confirmed that the results were identical

An interesting observation is that pyspark always interpreted the parquet columns as all nullable.

Are there any user-facing changes?

All changes are within crate-level structs

@nevi-me
Copy link
Contributor Author

nevi-me commented May 8, 2021

@alamb @jorgecarleitao this is a big one for me, as I've been able to verify that the writer does what it's supposed to do with mixed nested types.
I think the LevelInfo logic is now solid, and I should be able to start optimising it, like removing Vec<bool> and replacing it with bitmaps where it makes sense.

@codecov-commenter
Copy link

Codecov Report

Merging #270 (f779e23) into master (8bd769b) will increase coverage by 0.02%.
The diff coverage is 87.67%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #270      +/-   ##
==========================================
+ Coverage   82.53%   82.56%   +0.02%     
==========================================
  Files         162      162              
  Lines       43796    43822      +26     
==========================================
+ Hits        36149    36180      +31     
+ Misses       7647     7642       -5     
Impacted Files Coverage Δ
parquet/src/arrow/array_reader.rs 77.12% <80.95%> (-0.02%) ⬇️
parquet/src/arrow/levels.rs 82.59% <87.15%> (+1.33%) ⬆️
parquet/src/arrow/arrow_writer.rs 98.45% <100.00%> (+0.02%) ⬆️
parquet/src/encodings/encoding.rs 94.86% <0.00%> (-0.20%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8bd769b...f779e23. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented May 9, 2021

Hi @nevi-me -- thank you. I will probably will not have a chance to review this fully until tomorrow

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what an amazing PR, @nevi-me !

I was unable to follow smaller details, but overall I can understand what was done and it looks a great improvement to me.

I agree that it is necessary to have list_empty_def_level and list_null_def_level, and I also agree that LevelType makes a lot of sense.

Again thanks a lot and amazing work 💯

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like @jorgecarleitao I would say I did not follow all the details of this PR, but I did review all the test changes carefully and they look good to me. I think we can merge this

Nice work @nevi-me

parquet/src/arrow/levels.rs Outdated Show resolved Hide resolved
@nevi-me
Copy link
Contributor Author

nevi-me commented May 10, 2021

@jorgecarleitao @alamb there's still errors with the more complex cases like <struct<list<list<struct<list<primitive>>>>>>, where one of the items is 0 length. I'm going to work on that case this week.

I discovered the issue by chance when trying to benchmark the writer with deeply nested lists and structs. pyarrow manages to write this correctly (wrote an IPC file from Rust, then read it in pyarrow and wrote that to parquet)

@nevi-me
Copy link
Contributor Author

nevi-me commented May 11, 2021

I've opened #282 to track the remaining issue

@nevi-me nevi-me merged commit 8226219 into apache:master May 11, 2021
@alamb
Copy link
Contributor

alamb commented May 11, 2021

Great job @nevi-me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix compatibility quirks between arrow and parquet structs
4 participants