Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DeltaBitPackDecoder Incorrectly Handles Non-Zero MiniBlock Bit Width Padding #1417

Closed
tustvold opened this issue Mar 10, 2022 · 1 comment · Fixed by #1418
Closed

DeltaBitPackDecoder Incorrectly Handles Non-Zero MiniBlock Bit Width Padding #1417

tustvold opened this issue Mar 10, 2022 · 1 comment · Fixed by #1418
Labels
bug parquet Changes to the parquet crate

Comments

@tustvold
Copy link
Contributor

Describe the bug

From the parquet specification.

If, in the last block, less than miniblocks are needed to store the values, the bytes storing the bit widths of the unneeded miniblocks are still present, their value should be zero, but readers must accept arbitrary values as well.

This is particularly important at the moment because of #1416 which causes the padding to be arbitrary in the event an encoder is reused.

To Reproduce

This is one of the underlying bugs behind apache/datafusion#1976

Expected behavior

The decoder should ignore the padded miniblock bit widths

@tustvold tustvold added the bug label Mar 10, 2022
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 10, 2022
Ignore non-zero padded bit widths in DeltaBitPackDecoder (apache#1417)
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 10, 2022
Ignore non-zero padded bit widths in DeltaBitPackDecoder (apache#1417)
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 10, 2022
Ignore non-zero padded bit widths in DeltaBitPackDecoder (apache#1417)
tustvold added a commit to tustvold/arrow-rs that referenced this issue Mar 10, 2022
Ignore non-zero padded bit widths in DeltaBitPackDecoder (apache#1417)
@alamb
Copy link
Contributor

alamb commented Mar 10, 2022

Per @tustvold this is a regression introduced in #1284 (which was shipped in 9.1) https://github.com/apache/arrow-rs/blob/master/CHANGELOG.md#910-2022-02-19 and thus affects 9.1.0 and 10.0.0

We may want to consider a backport to release 9.1.1 and 10.0.1

alamb pushed a commit that referenced this issue Mar 14, 2022
* Consistent DeltaBitPackEncoder bit width padding (#1416)

Ignore non-zero padded bit widths in DeltaBitPackDecoder (#1417)

* chore: review feedback

* Add test of DeltaBitPackDecoder padding

* Revert formatting
@alamb alamb added the parquet Changes to the parquet crate label Mar 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants