Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading of dictionary encoded pages with null values (#1111) #1130

Conversation

yordan-pavlov
Copy link
Contributor

Which issue does this PR close?

Closes #1111.

Rationale for this change

As explained in #1111 RleDecoder as used in VariableLenDictionaryDecoder as part of the implementation of ArrowArrayReader, incorrectly returns more keys than are actually available while at the same time, when the page contains NULLs VariableLenDictionaryDecoder is also requesting more keys than available because num_values is inclusive of NULLs. This then results in incorrectly decoding a dictionary-encoded page which also contains NULLs and returning more values than necessary.

What changes are included in this PR?

This PR contains:

  • a fix where the actual number of values (excluding NULLs) is calculated from def levels (if present) and is used (instead of num_values from the data page) when creating the value decoder, so that it knows how many values are actually available. This is then used in existing code in VariableLenDictionaryDecoder to limit how many keys are requested from the nested RleDecoder.
  • a new test test_arrow_array_reader_dict_enc_string for ArrowArrayReader
  • a new test test_complex_array_reader_dict_enc_string for ArrayReader

Are there any user-facing changes?

No

@alamb @tustvold

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jan 2, 2022
@codecov-commenter
Copy link

codecov-commenter commented Jan 2, 2022

Codecov Report

Merging #1130 (a78fe15) into master (37b843b) will increase coverage by 0.04%.
The diff coverage is 94.55%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1130      +/-   ##
==========================================
+ Coverage   82.33%   82.38%   +0.04%     
==========================================
  Files         169      169              
  Lines       49773    50247     +474     
==========================================
+ Hits        40981    41395     +414     
- Misses       8792     8852      +60     
Impacted Files Coverage Δ
parquet/src/arrow/arrow_array_reader.rs 79.24% <90.83%> (+1.37%) ⬆️
parquet/src/arrow/array_reader.rs 78.26% <100.00%> (+1.60%) ⬆️
arrow/src/csv/reader.rs 88.10% <0.00%> (-2.48%) ⬇️
arrow/src/compute/kernels/comparison.rs 89.75% <0.00%> (-0.35%) ⬇️
arrow/src/array/array_union.rs 90.76% <0.00%> (-0.22%) ⬇️
arrow/src/array/builder.rs 86.49% <0.00%> (-0.05%) ⬇️
arrow/src/datatypes/mod.rs 100.00% <0.00%> (ø)
arrow/src/datatypes/datatype.rs 66.80% <0.00%> (ø)
arrow/src/compute/kernels/cast.rs 95.07% <0.00%> (+<0.01%) ⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37b843b...a78fe15. Read the comment docs.

@alamb
Copy link
Contributor

alamb commented Jan 3, 2022

Thank you @yordan-pavlov -- I plan to review / test this patch later today

@alamb
Copy link
Contributor

alamb commented Jan 3, 2022

FWIW I tested with datafusion on the case where we initially observed this issue and this patch fixes the issue: apache/datafusion#1441 (comment)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @yordan-pavlov -- as I mentioned, I verified that this change fixes the issues we had been seeing in DataFusion;

The basic idea of this PR looks great and makes sense to me, but I am not comfortable enough with the implementation of the parquet reader to fully understand it.

I wonder if @tustvold, @sunchao or @nevi-me could have a look?

Also, I wonder if there are ay benchmark results to demonstrate the performance change related to this PR?

num_values: usize,
) -> Result<usize> {
let mut def_level_decoder = LevelValueDecoder::new(level_decoder);
let def_level_array =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean an entirely new array of definition levels is created for each column? Might this result in a non trivial amount of extra runtime overhead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - def levels are decoded a second time for this fix and an i16 array and a boolean array are created to count the non-null values, but they only live for a very short time and the negative effect on performance is surprisingly small (3% to 8% in my benchmark run) see here: #1111 (comment) ; even after this change the ArrowArrayReader is still often several times faster for decoding strings compared to the old ArrayReader, this hasn't changed much.

It's probably possible to make this more efficient, but it would require more thinking and more time for a bigger change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense -- thank you for running the numbers

@tustvold
Copy link
Contributor

tustvold commented Jan 4, 2022

Thank you for this 🎉. I will take a look today if I have time

Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes sense to me, and seems like a pragmatic quick fix for the bug 👍

@@ -398,19 +416,37 @@ impl<'a, C: ArrayConverter + 'a> ArrowArrayReader<'a, C> {
offset,
def_levels_byte_len,
);
let value_count = Self::count_def_level_values(
Copy link
Contributor

@tustvold tustvold Jan 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to do num_values - num_nulls here as the DataPageV2 has this information. Unfortunately very few implementations in practice seem to produce such pages, and tbh I'm not entirely sure if num_nulls is what I think it is...

@alamb alamb merged commit 430bdd4 into apache:master Jan 5, 2022
@alamb
Copy link
Contributor

alamb commented Jan 5, 2022

Thanks again @yordan-pavlov

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ArrowArrayReader Reads Too Many Values From Bit-Packed Runs
4 participants