Use Vec in ColumnReader (#5177) #5193

tustvold · 2023-12-08T18:18:09Z

Which issue does this PR close?

Part of #5177
Closes #5150

Rationale for this change

See ticket, this pushes Vec into ColumnReader avoiding a whole host of issues related to buffer capacity

What changes are included in this PR?

Are there any user-facing changes?

Most of the changes are to crate-private interfaces, but this is a breaking change to ColumnReader which is public

tustvold · 2023-12-08T18:18:54Z

parquet/src/column/reader.rs

@@ -1016,34 +991,22 @@ mod tests {

    #[test]
    fn test_read_batch_values_only() {
-        test_read_batch_int32(16, &mut [0; 10], None, None); // < batch_size


We no longer have this complexity, as the buffers are sized as needed

tustvold · 2023-12-08T18:19:44Z

parquet/src/column/reader.rs

-        let mut max_levels = values.capacity().min(max_records);
-        if let Some(ref levels) = def_levels {
-            max_levels = max_levels.min(levels.capacity());
-        }
-        if let Some(ref levels) = rep_levels {
-            max_levels = max_levels.min(levels.capacity())
-        }


This is the fix for #5150, we no longer truncate the output based on the capacity of the output buffers

tustvold · 2023-12-08T18:21:29Z

parquet/src/arrow/array_reader/fixed_len_byte_array.rs

-    fn truncate_buffer(&mut self, len: usize) {
-        assert_eq!(self.buffer.len(), len * self.byte_length);
-    }
+    byte_length: Option<usize>,


This change is necessary so that we can use mem::take

tustvold · 2023-12-08T18:22:47Z

parquet/src/arrow/buffer/dictionary_buffer.rs

@@ -320,7 +287,7 @@ mod tests {
        );

        // Can recreate with new dictionary as keys empty
-        assert!(matches!(&buffer, DictionaryBuffer::Dict { .. }));
+        assert!(matches!(&buffer, DictionaryBuffer::Values { .. }));


This is the result of using std::mem::take, and doesn't impact the higher level dictionary preservation behaviour

tustvold · 2023-12-08T18:23:49Z

parquet/src/arrow/record_reader/buffer.rs

@@ -111,7 +50,7 @@ impl<T: Copy + Default> ValuesBuffer for Vec<T> {
        levels_read: usize,
        valid_mask: &[u8],
    ) {
-        assert!(self.len() >= read_offset + levels_read);
+        self.resize(read_offset + levels_read, T::default());


This change is necessary because we no longer have the get_output_slice / set_len behaviour which over-allocates at the start

tustvold · 2023-12-08T18:29:55Z

parquet/src/column/reader/decoder.rs

-        max_records: usize,
+        out: &mut Self::Buffer,
+        num_records: usize,
+        num_levels: usize,


We still have to provide num_levels as parquet's HYBRID_RLE encoding doesn't know how many values it contains 🤦

tustvold · 2023-12-08T18:30:52Z

parquet/src/arrow/record_reader/buffer.rs

@@ -17,69 +17,8 @@



This module is crate-private, and so none of these changes are user-visible

tustvold · 2023-12-08T18:31:36Z

parquet/src/column/reader/decoder.rs

@@ -423,7 +405,7 @@ impl RepetitionLevelDecoderImpl {
 }

 impl ColumnLevelDecoder for RepetitionLevelDecoderImpl {
-    type Slice = [i16];
+    type Buffer = Vec<i16>;


These traits are not user-visible, however, this change propagates out of ColumnReader and therefore is

tustvold · 2023-12-08T18:32:20Z

I intend to run the benchmarks prior to merge, as there is some non-trivial changes to capacity allocation that may have performance impacts.

tustvold · 2023-12-08T18:33:37Z

parquet/src/file/serialized_reader.rs

@@ -1750,7 +1750,7 @@ mod tests {
                assert_eq!(num_levels, 513);

                let expected: Vec<i64> = (1..514).collect();
-                assert_eq!(&buffer[..513], &expected);


I think this is a major usability enhancement, you no longer need to be careful to keep track of how much of the values buffers actually contain data

tustvold · 2023-12-08T18:34:16Z

parquet/src/arrow/record_reader/definition_levels.rs

-        usize::MAX
-    }
-
-    fn count_nulls(&self, range: Range<usize>, _max_level: i16) -> usize {


This is rolled into read_def_levels

alamb

Thank you for this @tustvold -- I started reviewing it and I will try and find the time to complete the review in the next day or two.

Can you please run some performance benchmarks to make sure there is no inadvertent regression introduced?

I also suggest we give it a day or two to let other people comment if they would like a chance to review it

cc @sunchao @zeevm @Dandandan

clicked wrong button, I haven't completed review

Dandandan

Nice cleanup, as long as the benchmarks are good this LGTM 🥳

viirya · 2023-12-08T20:53:46Z

parquet/src/arrow/array_reader/byte_array.rs

@@ -227,13 +226,13 @@ impl<I: OffsetSizeTrait> ColumnValueDecoder for ByteArrayColumnValueDecoder<I> {
        Ok(())
    }

-    fn read(&mut self, out: &mut Self::Slice, range: Range<usize>) -> Result<usize> {
+    fn read(&mut self, out: &mut Self::Buffer, num_values: usize) -> Result<usize> {


So previously seems you can pick up where (i.e., range.start) to start the read, now you must use skip_values to skip values?

Hmm, actually I also don't see how range.start is used to skip value. Maybe it is just no skip?

Its a weird quirk of how this API allowed using slices that didn't track their position, this is no longer necessary

tustvold · 2023-12-11T17:39:29Z

This currently represents a non-trivial regression for primitive columns with large numbers of nulls, I am investigating this now

Edit: I cannot reproduce this regression on my local machine, which is somewhat complicating fixing this

tustvold · 2023-12-15T18:25:30Z

Well having bashed my head against these phantom regressions for an embarrassingly long amount of time, it transpires the benchmarks are just exceptionally noisy 🤦 As written this PR does not represent a consistent performance regression as far as I can ascertain

Use Vec in ColumnReader (apache#5177)

51bbdbf

github-actions bot added the parquet Changes to the parquet crate label Dec 8, 2023

tustvold commented Dec 8, 2023

View reviewed changes

tustvold added the api-change Changes to the arrow API label Dec 8, 2023

Update parquet_derive

8f23a32

github-actions bot added the parquet-derive label Dec 8, 2023

tustvold commented Dec 8, 2023

View reviewed changes

alamb mentioned this pull request Dec 8, 2023

DataFusion weekly project plan (Andrew Lamb) - Dec 4, 2023 apache/datafusion#8420

Closed

7 tasks

alamb previously approved these changes Dec 8, 2023

View reviewed changes

Dandandan approved these changes Dec 8, 2023

View reviewed changes

viirya reviewed Dec 8, 2023

View reviewed changes

tustvold force-pushed the column-reader-vec branch from 4af59a5 to 8f23a32 Compare December 11, 2023 17:38

tustvold force-pushed the column-reader-vec branch from 7399917 to 8f23a32 Compare December 13, 2023 13:57

tustvold merged commit 9a1e8b5 into apache:master Dec 15, 2023
48 checks passed

This was referenced Jan 5, 2024

GenericColumnReader::read_records Yields Truncated Records #5150

Closed

Use Vec instead of Slice in ColumnReader #5177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Vec in ColumnReader (#5177) #5193

Use Vec in ColumnReader (#5177) #5193

tustvold commented Dec 8, 2023 •

edited

tustvold Dec 8, 2023

tustvold Dec 8, 2023

tustvold Dec 8, 2023

tustvold Dec 8, 2023

tustvold Dec 8, 2023

tustvold Dec 8, 2023

tustvold Dec 8, 2023

tustvold Dec 8, 2023

tustvold commented Dec 8, 2023

tustvold Dec 8, 2023

tustvold Dec 8, 2023

alamb left a comment

Dandandan left a comment

viirya Dec 8, 2023 •

edited

viirya Dec 8, 2023

tustvold Dec 8, 2023

tustvold commented Dec 11, 2023 •

edited

tustvold commented Dec 15, 2023

Use Vec in ColumnReader (#5177) #5193

Use Vec in ColumnReader (#5177) #5193

Conversation

tustvold commented Dec 8, 2023 • edited

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Dec 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Dandandan left a comment

Choose a reason for hiding this comment

viirya Dec 8, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold commented Dec 11, 2023 • edited

tustvold commented Dec 15, 2023

tustvold commented Dec 8, 2023 •

edited

viirya Dec 8, 2023 •

edited

tustvold commented Dec 11, 2023 •

edited