Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug while writing parquet with empty lists of structs #1166

Merged
merged 1 commit into from
Jan 14, 2022

Conversation

helgikrs
Copy link
Contributor

Which issue does this PR close?

Closes #703

What changes are included in this PR?

Fix a bug in the definition level calculation for fields nested within a struct and a list. When a list is empty or null in parquet the nested field gets a null value. However, in arrow, the value is simply missing. When serializing an immediate child of the list, the list offsets are used to calculate the correct definition level for its children, but it is not carried further to fields nested deeper (e.g., fields on a struct within a list). This (somewhat hacky) fix treats a struct within a list as if it were a list.

Are there any user-facing changes?

No.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 12, 2022
Fix a bug in the definition level calculation for fields nested within a
struct and a list. When a list is empty or null in parquet the nested
field gets a null value. However, in arrow, the value is simply missing.
When serializing an immediate child of the list, the list offsets are
used to calculate the correct definition level for its children, but it
is not carried further to fields nested deeper (e.g., fields on a struct
within a list).  This (somewhat hacky) fix treats a struct within a list
as if it were a list.
@houqp houqp requested review from nevi-me and alamb January 12, 2022 18:33
@houqp houqp added the bug label Jan 12, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jan 12, 2022

Codecov Report

Merging #1166 (33b6543) into master (884c6a6) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1166      +/-   ##
==========================================
+ Coverage   82.55%   82.59%   +0.04%     
==========================================
  Files         173      173              
  Lines       50673    50753      +80     
==========================================
+ Hits        41833    41921      +88     
+ Misses       8840     8832       -8     
Impacted Files Coverage Δ
parquet/src/arrow/levels.rs 84.28% <100.00%> (+1.40%) ⬆️
arrow/src/array/transform/mod.rs 85.56% <0.00%> (-0.14%) ⬇️
arrow/src/array/builder.rs 86.60% <0.00%> (+0.11%) ⬆️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (+0.22%) ⬆️
arrow/src/compute/kernels/arithmetic.rs 85.57% <0.00%> (+0.92%) ⬆️
parquet/src/arrow/arrow_reader.rs 90.93% <0.00%> (+1.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 884c6a6...33b6543. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not an expert in this area -- I would definitely appreciate a review from @nevi-me (or @tustvold as I blieve he has been cursing studying repetition and definition levels)

However, even if this case has some other subtlety we missed, I think the added coverage in this PR means the overall code is better than without so I approve.

Thank you @helgikrs -- very helpful to have someone else looking at this stuff. 🏆

let list_level =
&batch_level.calculate_array_levels(rb.column(0), rb.schema().field(0))[0];

let expected_level = LevelInfo {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be great if someone else who knew this better than I could double check this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not super confident in this either--it would be great if someone with knowledge about the details of this code could chime in.

The definition and repetition levels I compared with what the c++ parquet writer produces. I exported the above record batch and used the C++ parquet writer to generate a parquet file. I then used parquet-dump on the resulting file, which produced the following

value 1: R:0 D:4 V:1
value 2: R:0 D:1 V:<null>
value 3: R:0 D:0 V:<null>
value 4: R:0 D:2 V:<null>
value 5: R:1 D:2 V:<null>
value 6: R:0 D:3 V:<null>
value 7: R:0 D:4 V:2

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can confirm, these numbers check out 👍

.append_value(2)
.unwrap();
values.append(true).unwrap();
list_builder.append(true).unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked that the code matches the comments about what structure is intended to be created

@alamb
Copy link
Contributor

alamb commented Jan 13, 2022

cc @mosyp and @chadbrewbaker

@alamb
Copy link
Contributor

alamb commented Jan 14, 2022

Thanks @helgikrs !

@alamb alamb merged commit f2d7a21 into apache:master Jan 14, 2022
@alamb alamb changed the title Bugfix in parquet writing empty lists of structs Fix bug while writing parquet with empty lists of structs Jan 20, 2022
@alamb alamb removed the bug label Jan 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty or null list of struct cannot be written to parquet
5 participants