Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171) #1180

tustvold · 2022-01-14T23:52:06Z

Which issue does this PR close?

Closes #171.

Rationale for this change

See ticket

What changes are included in this PR?

Adds dictionary preservation when decoding dictionary encoded parquet data into dictionary encoded arrays.

Are there any user-facing changes?

No

tustvold · 2022-01-18T15:09:08Z

parquet/src/arrow/bit_util.rs

@@ -0,0 +1,95 @@
+// Licensed to the Apache Software Foundation (ASF) under one


Moved from definition_levels.rs

tustvold · 2022-01-18T15:09:25Z

parquet/src/arrow/record_reader/definition_levels.rs

@@ -351,39 +346,6 @@ impl PackedDecoder {
    }
 }

-/// Counts the number of set bits in the provided range


Moved into crate::arrow::bit_util

tustvold · 2022-01-18T15:10:46Z

parquet/src/arrow/record_reader/buffer.rs

    ///
    fn pad_nulls(
        &mut self,
        read_offset: usize,
        values_read: usize,
        levels_read: usize,
-        rev_valid_position_iter: impl Iterator<Item = usize>,
+        valid_mask: &[u8],


This change makes tests easier to write, I think is clearer and may in fact optimize better.

tustvold · 2022-01-18T15:12:00Z

parquet/src/arrow/array_reader/byte_array.rs

@@ -668,15 +668,9 @@ mod tests {
            assert_eq!(decoder.read(&mut output, 4..8).unwrap(), 0);

            let valid = vec![false, false, true, true, false, true, true, false, false];
-            let rev_position_iter = valid


This is an example of how changing pad_nulls to take a packed bitmask makes the tests less verbose

tustvold · 2022-01-18T18:30:51Z

parquet/src/arrow/array_reader/byte_array.rs


    #[test]
    fn test_byte_array_decoder() {
-        let data: Vec<_> = vec!["hello", "world", "a", "b"]


This code is extracted out into test_utils so that it can be reused

tustvold · 2022-01-18T18:31:03Z

parquet/src/arrow/array_reader/byte_array.rs

    use arrow::array::{Array, StringArray};
-    use std::sync::Arc;
-
-    fn column() -> ColumnDescPtr {


This code is extracted out into test_utils

tustvold · 2022-01-18T18:33:41Z

Unfortunately I needed to move some code around in order to write tests for this functionality, so this has ended up being another rather large diff 😞 but I've tried to comment where this has taken place so hopefully that helps.

tustvold · 2022-01-18T18:37:50Z

parquet/src/arrow/array_reader/byte_array_dictionary.rs

+        ArrowType::Dictionary(key_type, value_type) => {
+            make_reader! {
+                (pages, column_desc, data_type, null_mask_only) => match (key_type.as_ref(), value_type.as_ref()) {
+                    (ArrowType::UInt8, ArrowType::Binary | ArrowType::Utf8) => (u8, i32),


Many of these aren't properly supported by the arrow compute kernels, but I figured better to be thorough

alamb

I went through this pretty carefully and I think I understand the code and it makes sense to me 👍 @tustvold

I also ran the datafusion suite (cargo test --all) against this branch and all the tests passed

Do you have any performance results you can share here (e.g. a benchmark of reading a dictionary encoded string column from a parquet file to an Arrow array)?

Also, I think some sort of "end to end" test is probably appropriate -- i.e. that reads a parquet file and ensures that the output columns are really dictionary encoded.

FYI @yordan-pavlov and @nevi-me and @GregBowyer

I plan to write a mini program to validate that some of our internally generated parquet files did indeed read back a dictionary array rather than a StringArray

parquet/src/arrow/arrow_reader.rs

alamb · 2022-01-19T18:31:49Z

parquet/src/arrow/arrow_reader.rs

+            ArrowDataType::UInt64,
+        ];
+
+        for key in &key_types {


this is becoming a somewhat epic single test -- I wonder if it would be valuable to split it into smaller tests that can run in parallel (as a future PR)

Yeah... Tbh there may be value in splitting it out as an actual fuzz testing suite in its own right...

parquet/src/arrow/array_reader.rs

parquet/src/arrow/array_reader/byte_array_dictionary.rs

parquet/src/arrow/array_reader/dictionary_buffer.rs

alamb · 2022-01-19T19:10:02Z

parquet/src/arrow/array_reader/dictionary_buffer.rs

+        let strings = strings.as_any().downcast_ref::<StringArray>().unwrap();
+        assert_eq!(
+            strings.iter().collect::<Vec<_>>(),
+            vec![None, None, Some("world"), Some("hello")]


Since this was splitting off the first 4 elements (indicies 1, 0, 3, 2 and valid f,f,t,t) I would have expected the array produced to be None, None, Some("a"), Some("") so I clearly misunderstand something about how this code works

It reads 4 non-null values

[Some("world"), Some("hello"), Some("a"), Some(""), Some("b")]

It then pads them out based on the null mask

[None, None, Some("world"), Some("hello"), None, Some("a"), Some(""), Some("b")]

It then splits off the first four

[None, None, Some("world"), Some("hello")]

Leaving behind in the buffer

[None, Some("a"), Some(""), Some("b")]

thank you for the clarification

parquet/src/arrow/array_reader/dictionary_buffer.rs

alamb · 2022-01-19T19:13:02Z

parquet/src/arrow/array_reader/dictionary_buffer.rs

+    }
+
+    #[test]
+    fn test_validates_keys() {


Note to other reviewers that there is a test for decoding a dictionary whose keys are too large (e.g. an index of 500 when the keys type is Int8) in byte_array_dictionary.rs

alamb

With an end to end test I think this PR is ready to merge, but I would like to see if others reviewers would like time to review as well

alamb · 2022-01-19T19:42:59Z

I tried reading one of our internal parquet files with this branch and I got an internal error -- I will post a reproducer shortly

tustvold · 2022-01-19T20:41:11Z

Yup - looks like this barfs if it encounters an empty dictionary. Will add tests and fix tomorrow 👍

alamb · 2022-01-19T20:51:10Z

Yup - looks like this barfs if it encounters an empty dictionary. Will add tests and fix tomorrow 👍

For anyone following along, I shared the internal file with @tustvold . I also have some ideas to increase the coverage of the parquet reader (including the "end to end" test I mentioned reading out a file). I'll try and write it up (and maybe even code it) tomorrow

Don't panic if missing dictionary page

tustvold · 2022-01-20T11:20:43Z

I've pushed a commit that fixes the handling of empty dictionaries. I vacillated between returning an error if trying to read data from a column with an empty dictionary, or just returning 0 values read. In the end I went with the latter as it isn't always known how many non-null values a column contains.

If GenericColumnReader thinks there should be non-null values, as determined from the definition levels, it will return an error along the lines of "insufficient values read from column".

Whilst doing this I also changed to return an error instead of panicking if there is no dictionary page present.

Will work on an end-to-end test along with some benchmarks next

Tweak RecordReader buffering logic

parquet/src/arrow/record_reader.rs

codecov-commenter · 2022-01-20T15:10:26Z

Codecov Report

Merging #1180 (de33346) into master (0cc0c05) will increase coverage by 0.13%.
The diff coverage is 90.71%.

@@            Coverage Diff             @@
##           master    #1180      +/-   ##
==========================================
+ Coverage   82.67%   82.81%   +0.13%     
==========================================
  Files         175      179       +4     
  Lines       51561    52164     +603     
==========================================
+ Hits        42630    43199     +569     
- Misses       8931     8965      +34

Impacted Files	Coverage Δ
parquet/src/arrow/array_reader.rs	`78.19% <77.77%> (+0.74%)`	⬆️
...et/src/arrow/array_reader/byte_array_dictionary.rs	`83.33% <83.33%> (ø)`
parquet/src/arrow/array_reader/byte_array.rs	`85.71% <85.71%> (-1.35%)`	⬇️
parquet/src/arrow/array_reader/offset_buffer.rs	`93.10% <91.66%> (ø)`
...arquet/src/arrow/array_reader/dictionary_buffer.rs	`92.63% <92.63%> (ø)`
parquet/src/arrow/array_reader/test_util.rs	`100.00% <100.00%> (ø)`
parquet/src/arrow/arrow_reader.rs	`93.16% <100.00%> (+1.29%)`	⬆️
parquet/src/arrow/bit_util.rs	`100.00% <100.00%> (ø)`
parquet/src/arrow/record_reader.rs	`94.07% <100.00%> (+0.04%)`	⬆️
parquet/src/arrow/record_reader/buffer.rs	`92.42% <100.00%> (+3.19%)`	⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0cc0c05...de33346. Read the comment docs.

yordan-pavlov · 2022-01-20T19:57:54Z

@tustvold do you think that benchmarks should be amended to include reading of dictionary arrays?

tustvold · 2022-01-21T13:46:33Z

Added some benchmarks

arrow_array_reader/read StringDictionary, dictionary encoded, mandatory, no NULLs - old                                                                             
                        time:   [1.2224 ms 1.2231 ms 1.2238 ms]
arrow_array_reader/read StringDictionary, dictionary encoded, mandatory, no NULLs - new                                                                             
                        time:   [20.440 us 20.464 us 20.490 us]
arrow_array_reader/read StringDictionary, dictionary encoded, optional, no NULLs - old                                                                             
                        time:   [1.2893 ms 1.2898 ms 1.2902 ms]
arrow_array_reader/read StringDictionary, dictionary encoded, optional, no NULLs - new                                                                             
                        time:   [30.914 us 30.923 us 30.933 us]
arrow_array_reader/read StringDictionary, dictionary encoded, optional, half NULLs - old                                                                             
                        time:   [1.1340 ms 1.1344 ms 1.1347 ms]
arrow_array_reader/read StringDictionary, dictionary encoded, optional, half NULLs - new                                                                             
                        time:   [38.225 us 38.234 us 38.243 us]

So a nice 60x performance improvement 😄

tustvold · 2022-01-21T13:47:07Z

parquet/src/arrow/array_reader/byte_array_dictionary.rs

@@ -341,7 +341,9 @@ where
                    Some(keys) => {
                        // Happy path - can just copy keys
                        // Keys will be validated on conversion to arrow
-                        decoder.get_batch(keys.spare_capacity_mut(len))
+                        let keys_slice = keys.spare_capacity_mut(range.start + len);


I need to figure out why this didn't get caught by the fuzz tests

Fix in bf28e16

tustvold · 2022-01-21T15:14:00Z

Marking as a draft whilst I work on some further tweaks

Fix bug in column writer with small page sizes

tustvold · 2022-01-21T16:31:18Z

I'm happy with the state of this PR now, there is some further refinement I'd like to do on the traits, but I would rather do that as a separate PR as this one is getting quite large

alamb · 2022-01-21T18:33:14Z

I have one more test I want to run locally before approving this:

dump out the parquet data for some local files using this branch and another branch to make sure the same data results

Working on this now

alamb

I ran this branch against a bunch of production data we have here at Influx, using the https://github.com/alamb/pdump command built against this branch @ 8985aee and master at 0377aae

-parquet = { git = "https://github.com/apache/arrow-rs.git", rev="0377aaed5ff46214359d1b8d66c27f3afd9323c3" }
+parquet = { git = "https://github.com/tustvold/arrow-rs.git", rev="8985aeeb2ed1f55fbb061f8771ccd390df804d42" }

I dumped each file using pdump built against the two branches and compared the results using diff -- which detected no differences in the output 🎉

Epic work @tustvold

alamb · 2022-01-21T19:23:37Z

I plan to wait for the weekend to see if anyone else wants a chance to review this before I merge it in

yordan-pavlov · 2022-01-21T19:42:49Z

So a nice 60x performance improvement

@tustvold those benchmark results look very promising indeed; the next logical question is what's the impact to end-to-end performance of queries in Data Fusion

tustvold · 2022-01-21T20:32:19Z

next logical question is what's the impact to end-to-end performance of queries in Data Fusion

This is highly context dependent, queries that were previously bottlenecked by parquet will of course see improvements. A simple table scan with no predicates, for example, should see most of the raw 60x performance uplift. This sort of "query" shows up in IOx when compacting or loading data from object storage into an in-memory cache.

The story for more complex queries is a bit more WIP. Currently Datafusion's handling of dictionary encoded arrays isn't brilliant with it often fully materializing dictionaries when it shouldn't need to. apache/datafusion#1475 and apache/datafusion#1610 track the process of switching DataFusion to delegate comparator selection to arrow-rs, which should help to alleviate this.

TLDR at this point in time I'm focused on getting arrow-rs to a good place, with the necessary base primitives, and then I'll turn my attention to what Datafusion is doing with them. Who knows someone else may even get there first 😁

houqp

Nice optimization 👍

github-actions bot added the parquet Changes to the parquet crate label Jan 14, 2022

This was referenced Jan 16, 2022

parquet: Optimized ByteArrayReader, Add UTF-8 Validation (#1040) #1082

Merged

Parquet Scan Filter #1191

Closed

tustvold force-pushed the byte-array-dictionary branch 5 times, most recently from 58da0f8 to 8b0adeb Compare January 18, 2022 15:08

tustvold commented Jan 18, 2022

View reviewed changes

Preserve dictionary encoding from parquet (apache#171)

3142ec0

tustvold force-pushed the byte-array-dictionary branch from 8b0adeb to 3142ec0 Compare January 18, 2022 18:29

tustvold commented Jan 18, 2022

View reviewed changes

tustvold marked this pull request as ready for review January 18, 2022 18:32

tustvold commented Jan 18, 2022

View reviewed changes

Use OffsetBuffer::into_array for dictionary

f6b68c3

alamb changed the title ~~Preserve dictionary encoding from parquet (#171)~~ Preserve dictionary encoding when decoding parquet into Arrow arrays (#171) Jan 19, 2022

alamb reviewed Jan 19, 2022

View reviewed changes

Fix and test handling of empty dictionaries

4b183bd

Don't panic if missing dictionary page

tustvold added 2 commits January 20, 2022 11:47

Use ArrayRef instead of Arc<ArrayData>

f8b5cd8

Update doc comments

93c8dd3

Add integration test

de33346

Tweak RecordReader buffering logic

tustvold commented Jan 20, 2022

View reviewed changes

parquet/src/arrow/record_reader.rs Show resolved Hide resolved

Add benchmark

761bd90

tustvold commented Jan 21, 2022

View reviewed changes

tustvold marked this pull request as draft January 21, 2022 15:13

tustvold added 2 commits January 21, 2022 15:53

Set write batch size in parquet fuzz tests

bf28e16

Fix bug in column writer with small page sizes

Fix test_dictionary_preservation

c07b1b5

tustvold marked this pull request as ready for review January 21, 2022 16:22

Add batch_size comment

8985aee

alamb added the performance label Jan 21, 2022

alamb changed the title ~~Preserve dictionary encoding when decoding parquet into Arrow arrays (#171)~~ Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171) Jan 21, 2022

alamb approved these changes Jan 21, 2022

View reviewed changes

houqp approved these changes Jan 24, 2022

View reviewed changes

tustvold mentioned this pull request Jan 24, 2022

ARROW2: Performance benchmark apache/datafusion#1652

Closed

alamb merged commit d801ac2 into apache:master Jan 24, 2022

This was referenced Feb 4, 2022

Expose Dictionary to reader #1270

Closed

Restrict Decoder to compatible types (#1276) #1277

Merged

This was referenced May 6, 2022

Remove ComplexObjectArrayReader #1661

Closed

Remove parquet dictionary converters (#1661) #1662

Merged

tustvold mentioned this pull request May 30, 2022

Optimized Writing of Arrow Byte Array to Parquet #1764

Closed

alamb mentioned this pull request Mar 15, 2023

Improve performance of COUNT (distinct x) for dictionary columns apache/datafusion#258

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171) #1180

Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171) #1180

tustvold commented Jan 14, 2022 •

edited

Loading

tustvold Jan 18, 2022

tustvold Jan 18, 2022

tustvold Jan 18, 2022

tustvold Jan 18, 2022

tustvold Jan 18, 2022

tustvold Jan 18, 2022

tustvold commented Jan 18, 2022

tustvold Jan 18, 2022

alamb left a comment

alamb Jan 19, 2022

tustvold Jan 20, 2022

alamb Jan 19, 2022

tustvold Jan 19, 2022

alamb Jan 20, 2022

alamb Jan 19, 2022

alamb left a comment •

edited

Loading

alamb commented Jan 19, 2022

tustvold commented Jan 19, 2022

alamb commented Jan 19, 2022

tustvold commented Jan 20, 2022 •

edited

Loading

codecov-commenter commented Jan 20, 2022

yordan-pavlov commented Jan 20, 2022

tustvold commented Jan 21, 2022

tustvold Jan 21, 2022

tustvold Jan 21, 2022

tustvold commented Jan 21, 2022

tustvold commented Jan 21, 2022

alamb commented Jan 21, 2022

alamb left a comment

alamb commented Jan 21, 2022

yordan-pavlov commented Jan 21, 2022

tustvold commented Jan 21, 2022 •

edited

Loading

houqp left a comment

		@@ -0,0 +1,95 @@
		// Licensed to the Apache Software Foundation (ASF) under one

Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171) #1180

Preserve dictionary encoding when decoding parquet into Arrow arrays, 60x perf improvement (#171) #1180

Conversation

tustvold commented Jan 14, 2022 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jan 18, 2022

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb commented Jan 19, 2022

tustvold commented Jan 19, 2022

alamb commented Jan 19, 2022

tustvold commented Jan 20, 2022 • edited Loading

codecov-commenter commented Jan 20, 2022

Codecov Report

yordan-pavlov commented Jan 20, 2022

tustvold commented Jan 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Jan 21, 2022

tustvold commented Jan 21, 2022

alamb commented Jan 21, 2022

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jan 21, 2022

yordan-pavlov commented Jan 21, 2022

tustvold commented Jan 21, 2022 • edited Loading

houqp left a comment

Choose a reason for hiding this comment

tustvold commented Jan 14, 2022 •

edited

Loading

alamb left a comment •

edited

Loading

tustvold commented Jan 20, 2022 •

edited

Loading

tustvold commented Jan 21, 2022 •

edited

Loading