Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DeltaBitPack MiniBlock Bit Width Padding #1418

Merged
merged 4 commits into from
Mar 14, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #1416
Closes #1417

Rationale for this change

Fixes apache/datafusion#1976

What changes are included in this PR?

Makes DeltaBitPackEncoder padding deterministic, and makes DeltaBitPackDecoder resilient to non-zero padding

Are there any user-facing changes?

Output that was previously somewhat non-deterministic, is now deterministic.

Ignore non-zero padded bit widths in DeltaBitPackDecoder (apache#1417)
@github-actions github-actions bot added the parquet Changes to the parquet crate label Mar 10, 2022
@tustvold
Copy link
Contributor Author

I have tested Datafusion against this cherry-picked onto the 10.0.0 release and can confirm it fixes apache/datafusion#1976.

The branch is here. I do wonder if it might be an interesting exercise for me to help cut a patch release for this so I can get some exposure to this process, thoughts @alamb?

@@ -668,9 +667,9 @@ where

read += batch_read;
self.mini_block_remaining -= batch_read;
self.values_left -= batch_read;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for #1417 - the problem was that as self.value_left was only decremented outside the loop, the code in next_block was unable to correctly determine when the bit widths were padding

@@ -581,6 +581,13 @@ impl<T: DataType> DeltaBitPackEncoder<T> {
// values left
let n = cmp::min(self.mini_block_size, self.values_in_block);
if n == 0 {
// Decoders should be agnostic to the padding value, we therefore use 0xFF
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the fix for #1416, I opted to pad with a non-zero value for tests so we have coverage of that scenario, but not produce such data outside of tests for better ecosystem-interop

@alamb
Copy link
Contributor

alamb commented Mar 10, 2022

@tustvold is this a regression introduced in parquet? If so, which version was it introduced in (9.1 maybe?)

@alamb
Copy link
Contributor

alamb commented Mar 10, 2022

Was it introduced in #1284 maybe?

@tustvold
Copy link
Contributor Author

tustvold commented Mar 10, 2022

#1417 was a regression introduced by #1284 which was introduced in 9.1, #1416 was a pre-existing issue that has probably existed for a very long time

@alamb
Copy link
Contributor

alamb commented Mar 10, 2022

#1417 (comment)

Maybe we wait to make a backport to see if anyone else hits the issue before we spend time cutting a new release? I have no idea how prevalent it will be

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR needs some tests

@@ -610,11 +617,7 @@ impl<T: DataType> DeltaBitPackEncoder<T> {
self.values_in_block -= n;
}

assert!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this change -- the older assert seemed to be more specific, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah - I was under the impression assert_eq! logged the textual representation of the arguments, but it does not - will fix

@tustvold
Copy link
Contributor Author

tustvold commented Mar 10, 2022

I think this PR needs some tests

By making it pad with 0xFF when running in test mode we now get coverage from the existing tests?

Edit: I will see if I can come up with a standalone test for #1417

@codecov-commenter
Copy link

Codecov Report

Merging #1418 (e33d853) into master (4bcc7a6) will decrease coverage by 0.00%.
The diff coverage is 87.50%.

❗ Current head e33d853 differs from pull request most recent head 145a1dc. Consider uploading reports for the commit 145a1dc to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1418      +/-   ##
==========================================
- Coverage   83.17%   83.17%   -0.01%     
==========================================
  Files         182      182              
  Lines       53439    53443       +4     
==========================================
+ Hits        44449    44451       +2     
- Misses       8990     8992       +2     
Impacted Files Coverage Δ
parquet/src/encodings/encoding.rs 93.56% <83.33%> (-0.16%) ⬇️
parquet/src/encodings/decoding.rs 89.48% <100.00%> (+0.01%) ⬆️
arrow/src/datatypes/datatype.rs 66.40% <0.00%> (-0.40%) ⬇️
arrow/src/array/transform/mod.rs 86.31% <0.00%> (-0.12%) ⬇️
parquet_derive/src/parquet_field.rs 66.21% <0.00%> (+0.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4bcc7a6...145a1dc. Read the comment docs.

@tustvold
Copy link
Contributor Author

I added a test for #1417 and whilst doing so fixed a bug where a truncated page could get the decoder stuck in an infinite loop

@@ -652,6 +652,14 @@ where
.bit_reader
.get_batch(&mut buffer[read..read + batch_to_read], bit_width);

if batch_read != batch_to_read {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to this change, the decoder would get stuck in an infinite loop if given a truncated page. I discovered this by accident whilst writing test_delta_bit_packed_padding

let mut decoder = DeltaBitPackDecoder::<Int32Type>::new();
decoder.set_data(ptr.clone(), 0).unwrap();
assert_eq!(decoder.get(&mut output).unwrap(), 419);
assert_eq!(decoder.get_offset(), length);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the fix for #1417 this would fail

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks good to me. Thanks for identifying the bug!

+1 on having 9.1.1 and 10.0.1 branches for the fix.

@@ -444,7 +443,6 @@ pub struct DeltaBitPackDecoder<T: DataType> {
values_per_mini_block: usize,

// Per block info

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was cleaner, but don't feel strongly so reverted

@alamb
Copy link
Contributor

alamb commented Mar 12, 2022

I am happy to make 9.1.1 and 10.0.1 branches and prepare the releases.

@sunchao would you recommend we proactively release 9.1.1 and 10.0.1? If so I'll make a ticket and start the process. It is probably the right thing to do I suppose -- I was just dreading the voting process :)

@sunchao
Copy link
Member

sunchao commented Mar 13, 2022

SGTM @alamb , although I'm not sure if we should wait for #1432 too for 10.0.1. BTW I can help on the release process too.

@alamb
Copy link
Contributor

alamb commented Mar 14, 2022

SGTM @alamb , although I'm not sure if we should wait for #1432 too for 10.0.1. BTW I can help on the release process too.

If you could talk on creating the 9.1.1 and 10.0.1 releases that would be great! I'll file a ticket to start tracking it

@alamb
Copy link
Contributor

alamb commented Mar 14, 2022

#1442 tracking the patch releases

@alamb
Copy link
Contributor

alamb commented Mar 14, 2022

Thanks @tustvold and @sunchao

@alamb alamb merged commit 40ef4aa into apache:master Mar 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
4 participants