Fix reading of dictionary encoded pages with null values (#1111) #1130

yordan-pavlov · 2022-01-02T19:11:05Z

Which issue does this PR close?

Closes #1111.

Rationale for this change

As explained in #1111 RleDecoder as used in VariableLenDictionaryDecoder as part of the implementation of ArrowArrayReader, incorrectly returns more keys than are actually available while at the same time, when the page contains NULLs VariableLenDictionaryDecoder is also requesting more keys than available because num_values is inclusive of NULLs. This then results in incorrectly decoding a dictionary-encoded page which also contains NULLs and returning more values than necessary.

What changes are included in this PR?

This PR contains:

a fix where the actual number of values (excluding NULLs) is calculated from def levels (if present) and is used (instead of num_values from the data page) when creating the value decoder, so that it knows how many values are actually available. This is then used in existing code in VariableLenDictionaryDecoder to limit how many keys are requested from the nested RleDecoder.
a new test test_arrow_array_reader_dict_enc_string for ArrowArrayReader
a new test test_complex_array_reader_dict_enc_string for ArrayReader

Are there any user-facing changes?

No

@alamb @tustvold

codecov-commenter · 2022-01-02T19:23:49Z

Codecov Report

Merging #1130 (a78fe15) into master (37b843b) will increase coverage by 0.04%.
The diff coverage is 94.55%.

@@            Coverage Diff             @@
##           master    #1130      +/-   ##
==========================================
+ Coverage   82.33%   82.38%   +0.04%     
==========================================
  Files         169      169              
  Lines       49773    50247     +474     
==========================================
+ Hits        40981    41395     +414     
- Misses       8792     8852      +60

Impacted Files	Coverage Δ
parquet/src/arrow/arrow_array_reader.rs	`79.24% <90.83%> (+1.37%)`	⬆️
parquet/src/arrow/array_reader.rs	`78.26% <100.00%> (+1.60%)`	⬆️
arrow/src/csv/reader.rs	`88.10% <0.00%> (-2.48%)`	⬇️
arrow/src/compute/kernels/comparison.rs	`89.75% <0.00%> (-0.35%)`	⬇️
arrow/src/array/array_union.rs	`90.76% <0.00%> (-0.22%)`	⬇️
arrow/src/array/builder.rs	`86.49% <0.00%> (-0.05%)`	⬇️
arrow/src/datatypes/mod.rs	`100.00% <0.00%> (ø)`
arrow/src/datatypes/datatype.rs	`66.80% <0.00%> (ø)`
arrow/src/compute/kernels/cast.rs	`95.07% <0.00%> (+<0.01%)`	⬆️
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37b843b...a78fe15. Read the comment docs.

alamb · 2022-01-03T12:53:29Z

Thank you @yordan-pavlov -- I plan to review / test this patch later today

alamb · 2022-01-03T20:17:01Z

FWIW I tested with datafusion on the case where we initially observed this issue and this patch fixes the issue: apache/datafusion#1441 (comment)

alamb

Thank you so much @yordan-pavlov -- as I mentioned, I verified that this change fixes the issues we had been seeing in DataFusion;

The basic idea of this PR looks great and makes sense to me, but I am not comfortable enough with the implementation of the parquet reader to fully understand it.

I wonder if @tustvold, @sunchao or @nevi-me could have a look?

Also, I wonder if there are ay benchmark results to demonstrate the performance change related to this PR?

alamb · 2022-01-03T20:20:50Z

parquet/src/arrow/arrow_array_reader.rs

+        num_values: usize,
+    ) -> Result<usize> {
+        let mut def_level_decoder = LevelValueDecoder::new(level_decoder);
+        let def_level_array =


does this mean an entirely new array of definition levels is created for each column? Might this result in a non trivial amount of extra runtime overhead?

Yes - def levels are decoded a second time for this fix and an i16 array and a boolean array are created to count the non-null values, but they only live for a very short time and the negative effect on performance is surprisingly small (3% to 8% in my benchmark run) see here: #1111 (comment) ; even after this change the ArrowArrayReader is still often several times faster for decoding strings compared to the old ArrayReader, this hasn't changed much.

It's probably possible to make this more efficient, but it would require more thinking and more time for a bigger change.

makes sense -- thank you for running the numbers

tustvold · 2022-01-04T10:57:06Z

Thank you for this 🎉. I will take a look today if I have time

tustvold

This change makes sense to me, and seems like a pragmatic quick fix for the bug 👍

tustvold · 2022-01-04T22:11:16Z

parquet/src/arrow/arrow_array_reader.rs

@@ -398,19 +416,37 @@ impl<'a, C: ArrayConverter + 'a> ArrowArrayReader<'a, C> {
                            offset,
                            def_levels_byte_len,
                        );
+                        let value_count = Self::count_def_level_values(


You might be able to do num_values - num_nulls here as the DataPageV2 has this information. Unfortunately very few implementations in practice seem to produce such pages, and tbh I'm not entirely sure if num_nulls is what I think it is...

alamb · 2022-01-05T19:57:25Z

Thanks again @yordan-pavlov

fix reading of dictionary encoded pages with null values

8fc05da

github-actions bot added the parquet Changes to the parquet crate label Jan 2, 2022

fix linting issues

a78fe15

alamb mentioned this pull request Jan 3, 2022

Incorrect results in datafusion apache/datafusion#1441

Closed

alamb reviewed Jan 3, 2022

View reviewed changes

tustvold reviewed Jan 4, 2022

View reviewed changes

alamb approved these changes Jan 5, 2022

View reviewed changes

alamb merged commit 430bdd4 into apache:master Jan 5, 2022

tustvold mentioned this pull request Jan 10, 2022

Extends parquet fuzz tests to also tests nulls, dictionaries and row groups with multiple pages (#1053) #1110

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix reading of dictionary encoded pages with null values (#1111) #1130

Fix reading of dictionary encoded pages with null values (#1111) #1130

yordan-pavlov commented Jan 2, 2022

codecov-commenter commented Jan 2, 2022 •

edited

Loading

alamb commented Jan 3, 2022

alamb commented Jan 3, 2022

alamb left a comment

alamb Jan 3, 2022

yordan-pavlov Jan 3, 2022

alamb Jan 4, 2022

tustvold commented Jan 4, 2022

tustvold left a comment

tustvold Jan 4, 2022 •

edited

Loading

alamb commented Jan 5, 2022

Fix reading of dictionary encoded pages with null values (#1111) #1130

Fix reading of dictionary encoded pages with null values (#1111) #1130

Conversation

yordan-pavlov commented Jan 2, 2022

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Jan 2, 2022 • edited Loading

Codecov Report

alamb commented Jan 3, 2022

alamb commented Jan 3, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb Jan 3, 2022

Choose a reason for hiding this comment

yordan-pavlov Jan 3, 2022

Choose a reason for hiding this comment

alamb Jan 4, 2022

Choose a reason for hiding this comment

tustvold commented Jan 4, 2022

tustvold left a comment

Choose a reason for hiding this comment

tustvold Jan 4, 2022 • edited Loading

Choose a reason for hiding this comment

alamb commented Jan 5, 2022

codecov-commenter commented Jan 2, 2022 •

edited

Loading

tustvold Jan 4, 2022 •

edited

Loading