Improve parquet reading performance for columns with nulls by preserving bitmask when possible (#1037) #1054

tustvold · 2021-12-17T17:27:07Z

Which issue does this PR close?

~~Highly experimental, builds on #1021 #1039 #1052 #1041~~

Rationale for this change

See ticket.

This leads to anything from a 2-6x performance improvement when decoding columns containing nulls. As is to be expected the biggest savings are where the other decode overheads are less - with the 6x return on "Int32Array, plain encoded, optional, half NULLs - old "

There is some funkiness with the benchmarks and the memory allocator on my local machine, with it "faster" to preallocate a single 64 byte array first before trying to read data.

What changes are included in this PR?

This changes RecordReader to use a new DefinitionLevelBuffer that has a corresponding DefinitionLevelDecoder that can read directly from parquet. Skipping intermediate buffering, and avoiding decoding parquet bitmasks where not necessary

Are there any user-facing changes?

No

alamb · 2021-12-21T14:47:44Z

parquet/src/column/reader.rs

+>;
+
+#[doc(hidden)]
+pub struct GenericColumnReader<R, D, V> {


It would help me to document somewhere what R, D and V are intended for (to make reading this code easier)

tustvold · 2021-12-21T15:43:53Z

Pushed a PR that fixes a bug in the handling of ColumnLevelDecoder::read w.r.t ranges. This is an area of the traits that I'm currently not very happy with and largely stems from preserving the ability to pass [i32] to ColumnLevelDecoderImpl, which have no state about where they've been written up until and so need this to be passed in. I hope to remove this derp prior to merge of #1041 and by extension this

arrow/src/array/builder.rs

yordan-pavlov · 2021-12-26T23:25:17Z

parquet/src/arrow/record_reader/definition_levels.rs

+            } else if self.packed_count != self.packed_offset {
+                let to_read = (self.packed_count - self.packed_offset).min(len - read);
+                let offset = self.data_offset * 8 + self.packed_offset;
+                buffer.append_packed_range(offset..offset + to_read, self.data.as_ref());


looks like this is the main change in this PR? how often does this case happen for def levels in practice?

how often does this case happen for def levels in practice

This depends on what you mean by "this" 😅

The major change in this PR is not decoding definition levels for columns without nested nullability - i.e. max_def_level == 1, and just decoding directly to the null bitmask. This is very common, with almost all parquet data I've come across being flat.

My personal experience with projects trying to use nested data in parquet is eventually it becomes too much of a pain due to the patchy ecosystem support, and the schema ends up just getting flattened

Previously the code would allocate i16 buffers, populate them with the decoded data, and then deduce a null bitmask from these i16 buffers. This code will now decode directly to the null bitmask in the event of max_def_level == 1, avoiding allocations along with the costs associated with decode and bitmask reconstruction.

As an added bonus, it happens that by decoding directly we can exploit the inherent properties of the hybrid encoding to improve performance - with the packed representation already being a bitmask, and the RLE representation allowing operations on runs of bits.

Apologies, I should have been more explicit; what I meant is how common is it in practice to have max_def_level == 1 plus bit-packing of the def levels, because this is where the biggest optimization is, isn't it. RLE-encoded def level reading would still be better than before (as no intermediate translation into integers) and that's great, but probably not as fast as directly copying the bit-packed values. I do agree on flat parquet files being common though, most parquet files I have seen have been flat as well.

on second thought, reading of run-length-encoded def levels could be just as fast if append_packed could be used for it as well (except that the buffer to copy from would be a static buffer of all 1s of some fixed length)

plus bit-packing of the def levels

The logic within RleEncoder uses run-length encoding if the repetition count is greater than 8, otherwise it uses the bit-packed version. Therefore how common bit-packed sequences are depends on the distribution of nulls within the data.

TBC what is called RLE encoding by parquet is actually hybrid encoding, a page isn't entirely bit-packed or run-length encoded, but contains blocks of either

but probably not as fast as directly copying the bit-packed values

I'm not sure I agree with this, copying the bit-packed values is actually potentially more expensive, as it requires shifting and masking the source data. By contrast, inserting a run of nulls is simply a case of incrementing the length of the buffer (as everything is 0-initialized), whereas setting sequences of valid bits can be done at the byte level (or possibly larger).

arrow/src/array/builder.rs

yordan-pavlov · 2021-12-28T00:07:54Z

parquet/src/arrow/record_reader/definition_levels.rs

+        while read != len {
+            if self.rle_left != 0 {
+                let to_read = self.rle_left.min(len - read);
+                buffer.append_n(to_read, self.rle_value);


I wonder if append_n could be made faster using append_packed

arrow/src/array/builder.rs

codecov-commenter · 2022-01-11T16:42:33Z

Codecov Report

Merging #1054 (55c2f6f) into master (06431ee) will increase coverage by 0.04%.
The diff coverage is 86.57%.

@@            Coverage Diff             @@
##           master    #1054      +/-   ##
==========================================
+ Coverage   82.53%   82.58%   +0.04%     
==========================================
  Files         173      173              
  Lines       50615    50876     +261     
==========================================
+ Hits        41774    42014     +240     
- Misses       8841     8862      +21

Impacted Files	Coverage Δ
parquet/src/arrow/array_reader.rs	`77.16% <81.57%> (-0.14%)`	⬇️
...rquet/src/arrow/record_reader/definition_levels.rs	`86.20% <86.74%> (-4.12%)`	⬇️
arrow/src/array/builder.rs	`86.50% <100.00%> (+0.01%)`	⬆️
parquet/src/arrow/record_reader.rs	`94.75% <100.00%> (+0.74%)`	⬆️
parquet/src/arrow/record_reader/buffer.rs	`86.00% <100.00%> (+0.89%)`	⬆️
parquet/src/column/reader/decoder.rs	`76.27% <100.00%> (ø)`
arrow/src/array/mod.rs	`100.00% <0.00%> (ø)`
arrow/src/array/cast.rs	`91.66% <0.00%> (ø)`
arrow/src/buffer/ops.rs	`96.77% <0.00%> (ø)`
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06431ee...55c2f6f. Read the comment docs.

tustvold · 2022-01-12T13:17:34Z

Looking into test failures

alamb

I think this is great work. Thank you @tustvold

@yordan-pavlov would you like to review again prior to merge?

alamb · 2022-01-12T14:55:02Z

parquet/src/arrow/record_reader.rs

+    /// [`Self::consume_def_levels`] and [`Self::consume_rep_levels`] will always return `None`
+    ///
+    pub(crate) fn new_with_options(desc: ColumnDescPtr, null_mask_only: bool) -> Self {
+        let def_levels = (desc.max_def_level() > 0)


I don't understand the use of null_mask_only here -- I thought null_mask_only would be set only if max_def_level() == )

AKA https://github.com/apache/arrow-rs/pull/1054/files#diff-0d6bed48d78c5a2472b7680a8185cabdc0bd259d6484e184439ed7830060661fR1374

Added a comment clarifying, its an edge case of nested nullability. Perhaps I should add an explicit test 🤔

Test added in 59846eb

parquet/src/arrow/record_reader/definition_levels.rs

alamb · 2022-01-12T15:02:40Z

parquet/src/arrow/record_reader/definition_levels.rs

-    type Slice = [i16];
+impl DefinitionLevelBuffer {
+    pub fn new(desc: &ColumnDescPtr, null_mask_only: bool) -> Self {
+        let inner = match null_mask_only {


I wonder why null_mask_only is passed down all the way here only to be rechecked / assert!ed.

Would it be possible / feasible to decide here in DefinitionLevelBuilder::new to use BufferInner::Mask if max_def_level() is 1 and max_rep_levels() is 0 and thus avoid passing plumbing the argument around?

alamb · 2022-01-12T15:07:34Z

parquet/src/arrow/record_reader/definition_levels.rs

+                let decoder = match self.data.take() {
+                    Some(data) => self
+                        .packed_decoder
+                        .insert(PackedDecoder::new(self.encoding, data)),


TIL: Option::insert 👍

alamb · 2022-01-12T15:15:09Z

parquet/src/arrow/record_reader/definition_levels.rs

+    }
+}
+
+struct PackedDecoder {


This code looks quite similar to BitReader https://github.com/tustvold/arrow-rs/blob/bitmask-preservation/parquet/src/util/bit_util.rs#L501

I wonder if you looked at possibly reusing that implmentation?

The short answer is not using that implementation is the major reason this PR represents a non-trivial speed bump, it can decode more optimally as it can decode directly using append_packed_range / append_n. Will add some comments clarifying

alamb · 2022-01-12T15:18:14Z

parquet/src/arrow/record_reader/definition_levels.rs

+
+    use rand::{thread_rng, Rng, RngCore};
+
+    #[test]


I know there is now significant coverage of this code using the fuzz tests -- #1156 and friends.

Do you think that is sufficient coverage for PackedDecoder ? Or would some more targeted unit tests be valueble too?

Should be possible to write a simple test that compares the output with that of BitReader 👍 Will do

Test added in b001f11

alamb · 2022-01-12T15:19:40Z

parquet/src/arrow/record_reader/definition_levels.rs

+    packed_offset: usize,
+}
+
+impl PackedDecoder {


I don't understand the details of the parquet format sufficiently to truly evaluate the correctness of this code; Perhaps some additional test coverage would help, but the fuzz testing may be good enough.

alamb · 2022-01-12T15:21:45Z

parquet/src/arrow/record_reader/definition_levels.rs

+    }
+}
+
+struct PackedDecoder {


Suggested change

struct PackedDecoder {

/// Specialized decoder for bitpacked hybrid format (TODO link) that contains

/// only 0 and 1 (for example, definition levels in a non-nested column)

/// that directly decodes into a bitmask in the fastest possible way

struct PackedDecoder {

I am trying to leave breadcrumbs for the next person to look at this code. Is this a correct description of what this structure implements?

alamb · 2022-01-12T15:23:09Z

Likewise cc @nevi-me @sunchao in case you are interested

alamb · 2022-01-12T15:23:41Z

Unless anyone wants additional time to review, I'll plan to merge this tomorrow

alamb · 2022-01-13T15:17:50Z

parquet/src/arrow/record_reader/definition_levels.rs

@@ -228,6 +232,20 @@ impl ColumnLevelDecoder for DefinitionLevelDecoder {
    }
 }

+/// An optimized decoder for decoding [RLE] and [BIT_PACKED] data with a bit width of 1


alamb · 2022-01-13T15:19:02Z

Thanks @tustvold -- this is pretty epic

yordan-pavlov · 2022-01-15T21:41:28Z

parquet/src/arrow/record_reader/definition_levels.rs

+                assert_eq!(range.start + writer.len, nulls.len());
+
+                let decoder = match self.data.take() {
+                    Some(data) => self.column_decoder.insert(


it looks like the intention is that self.data will only be used once (to create a ColumnLevelDecoderImpl) and if that's the case, why not move the entire match statement in the constructor?

Because the type of writer determines the type of decoder. If BufferInner::Full it constructs ColumnLevelDecoderImpl, otherwise it constructs PackedDecoder. I guess we could just construct both, but this way you'd get a panic if you change writer type...

github-actions bot added arrow Changes to the arrow crate parquet Changes to the parquet crate labels Dec 17, 2021

tustvold mentioned this pull request Dec 19, 2021

RleDecoder Inline Buffer #1061

Closed

alamb reviewed Dec 21, 2021

View reviewed changes

tustvold mentioned this pull request Dec 21, 2021

parquet: Optimized ByteArrayReader, Add UTF-8 Validation (#1040) #1082

Merged

yordan-pavlov reviewed Dec 21, 2021

View reviewed changes

arrow/src/array/builder.rs Outdated Show resolved Hide resolved

yordan-pavlov reviewed Dec 26, 2021

View reviewed changes

yordan-pavlov reviewed Dec 28, 2021

View reviewed changes

arrow/src/array/builder.rs Outdated Show resolved Hide resolved

yordan-pavlov reviewed Dec 28, 2021

View reviewed changes

arrow/src/array/builder.rs Outdated Show resolved Hide resolved

tustvold mentioned this pull request Dec 29, 2021

Generify ColumnReaderImpl and RecordReader (#1040) #1041

Merged

tustvold mentioned this pull request Jan 11, 2022

Discussion: Switch DataFusion to using arrow2? apache/datafusion#1532

Closed

tustvold force-pushed the bitmask-preservation branch 2 times, most recently from 2dc5b80 to c92e79b Compare January 11, 2022 16:29

Preserve bitmask (apache#1037)

94d66ad

tustvold force-pushed the bitmask-preservation branch from c92e79b to 94d66ad Compare January 11, 2022 18:19

tustvold marked this pull request as ready for review January 11, 2022 18:22

Remove now unnecessary box (apache#1061)

3ab4125

Fix handling of empty bitmasks

b51fdd6

alamb changed the title ~~Preserve Parquet Bitmask (#1037)~~ Improve parquet reading performance for columns with nulls by preserving bitmask when possible (#1037) Jan 12, 2022

alamb approved these changes Jan 12, 2022

View reviewed changes

tustvold added 3 commits January 12, 2022 17:29

More docs

55c2f6f

Add nested nullability test case

59846eb

Add packed decoder test

b001f11

alamb approved these changes Jan 13, 2022

View reviewed changes

alamb merged commit 231cf78 into apache:master Jan 13, 2022

tustvold mentioned this pull request Jan 15, 2022

Truncate bitmask on BooleanBufferBuilder::resize: #1183

Merged

yordan-pavlov reviewed Jan 15, 2022

View reviewed changes

alamb added the performance label Jan 20, 2022

alamb mentioned this pull request Jan 27, 2022

ARROW2: Performance benchmark apache/datafusion#1652

Closed

bjchambers mentioned this pull request Jan 28, 2022

Parquet v8.0.0 panics when reading all null column to NullArray #1245

Closed

tustvold mentioned this pull request Jan 28, 2022

Fix NullArrayReader (#1245) #1246

Merged

tustvold mentioned this pull request Feb 6, 2022

Restrict Decoder to compatible types (#1276) #1277

Merged

tustvold mentioned this pull request Jul 20, 2022

Simplify null mask preservation in parquet reader #2116

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve parquet reading performance for columns with nulls by preserving bitmask when possible (#1037) #1054

Improve parquet reading performance for columns with nulls by preserving bitmask when possible (#1037) #1054

tustvold commented Dec 17, 2021 •

edited

alamb Dec 21, 2021

tustvold commented Dec 21, 2021 •

edited

yordan-pavlov Dec 26, 2021 •

edited

tustvold Dec 29, 2021 •

edited

yordan-pavlov Dec 29, 2021

yordan-pavlov Dec 29, 2021

tustvold Dec 29, 2021 •

edited

yordan-pavlov Dec 28, 2021

codecov-commenter commented Jan 11, 2022 •

edited

tustvold commented Jan 12, 2022

alamb left a comment

alamb Jan 12, 2022

tustvold Jan 12, 2022

tustvold Jan 13, 2022

alamb Jan 12, 2022

alamb Jan 12, 2022

alamb Jan 12, 2022

tustvold Jan 12, 2022

alamb Jan 12, 2022

tustvold Jan 12, 2022

tustvold Jan 13, 2022

alamb Jan 12, 2022

alamb Jan 12, 2022

alamb Jan 12, 2022

alamb commented Jan 12, 2022

alamb commented Jan 12, 2022

alamb Jan 13, 2022

alamb commented Jan 13, 2022

yordan-pavlov Jan 15, 2022

tustvold Jan 15, 2022

-struct PackedDecoder {
+/// Specialized decoder for bitpacked hybrid format (TODO link) that contains
+/// only 0 and 1 (for example, definition levels in a non-nested column)
+/// that directly decodes into a bitmask in the fastest possible way
+struct PackedDecoder {

Improve parquet reading performance for columns with nulls by preserving bitmask when possible (#1037) #1054

Improve parquet reading performance for columns with nulls by preserving bitmask when possible (#1037) #1054

Conversation

tustvold commented Dec 17, 2021 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

tustvold commented Dec 21, 2021 • edited

yordan-pavlov Dec 26, 2021 • edited

Choose a reason for hiding this comment

tustvold Dec 29, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold Dec 29, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Jan 11, 2022 • edited

Codecov Report

tustvold commented Jan 12, 2022

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jan 12, 2022

alamb commented Jan 12, 2022

Choose a reason for hiding this comment

alamb commented Jan 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Dec 17, 2021 •

edited

tustvold commented Dec 21, 2021 •

edited

yordan-pavlov Dec 26, 2021 •

edited

tustvold Dec 29, 2021 •

edited

tustvold Dec 29, 2021 •

edited

codecov-commenter commented Jan 11, 2022 •

edited