Truncate IPC record batch #2040

viirya · 2022-07-11T01:32:15Z

Which issue does this PR close?

Closes #1528.
Closes #208

Rationale for this change

When serialize RecordBatch through IPC, we should truncate it to only serialize the necessary part if the batch is sliced.

C++ implementation also truncates it:
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L351
https://github.com/apache/arrow/blob/master/cpp/src/arrow/ipc/writer.cc#L327

What changes are included in this PR?

Truncate arrays of record batch when serializing it through IPC. Following C++, we truncate arrays of numeric type, temporal types, fixed size binary type and base binary types (binary, string).

Are there any user-facing changes?

codecov-commenter · 2022-07-11T01:48:52Z

Codecov Report

Merging #2040 (50c4eb7) into master (330505c) will increase coverage by 0.05%.
The diff coverage is 93.50%.

@@            Coverage Diff             @@
##           master    #2040      +/-   ##
==========================================
+ Coverage   83.55%   83.60%   +0.05%     
==========================================
  Files         222      223       +1     
  Lines       58230    58539     +309     
==========================================
+ Hits        48656    48944     +288     
- Misses       9574     9595      +21

Impacted Files	Coverage Δ
arrow/src/array/mod.rs	`100.00% <ø> (ø)`
arrow/src/datatypes/datatype.rs	`64.05% <30.00%> (-1.63%)`	⬇️
arrow/src/ipc/writer.rs	`85.05% <97.91%> (+3.26%)`	⬆️
parquet/src/file/serialized_reader.rs	`93.58% <0.00%> (-1.76%)`	⬇️
arrow/src/array/builder/generic_list_builder.rs	`95.09% <0.00%> (-1.61%)`	⬇️
...row/src/array/builder/string_dictionary_builder.rs	`90.64% <0.00%> (-0.72%)`	⬇️
parquet/src/file/metadata.rs	`94.70% <0.00%> (-0.69%)`	⬇️
parquet_derive/src/parquet_field.rs	`65.75% <0.00%> (-0.46%)`	⬇️
parquet/src/encodings/encoding.rs	`93.43% <0.00%> (-0.20%)`	⬇️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 330505c...50c4eb7. Read the comment docs.

HaoYang670

I am not very familiar with IPC, so just give some small nits I find.

arrow/src/ipc/writer.rs

tustvold

Looking good, I think there is an issue with the way bitmaps are currently handled but I think this should just be a case of using Buffer::bit_slice or similar

arrow/src/ipc/writer.rs

tustvold · 2022-07-12T13:14:20Z

arrow/src/ipc/writer.rs

+        ) {
+            // Rebase offsets and truncate values
+            let new_offsets = get_zero_based_value_offsets(array_data);
+            offset = write_buffer(new_offsets.as_slice(), buffers, arrow_data, offset);


Definitely something that could be left to another PR, but I do wonder if there might be some way to write the new offsets directly without buffering them first. I'm not very familiar with the IPC code so not sure how feasible this may be

alamb · 2022-07-12T16:52:05Z

❤️ -- thank you @viirya

Note we have some code in IOx to workaround the lack of this feature (described in https://github.com/influxdata/influxdb_iox/issues/1133):
https://github.com/influxdata/influxdb_iox/blob/3592aa52d815e861ac14031186ba557c9290dfaf/arrow_util/src/optimize.rs#L97-L141

I wonder if this PR also closes #208?

viirya · 2022-07-12T17:00:36Z

Yes, flight data also uses IPC to encode record batch. This also closes #208 too.

alamb · 2022-07-12T19:33:37Z

Yes, flight data also uses IPC to encode record batch. This also closes #208 too.

Updated the PR description to match

…test.

arrow/src/ipc/writer.rs

tustvold · 2022-07-14T18:41:13Z

Apologies I'm trying to review this, but I'm extremely confused as to how this has been and is working, I can't find where the IPC writer makes use of the ArrayData offset, and am therefore at a loss as to how slicing has historically been being preserved. Could you possibly point me at the relevant code, as far as I can tell it previously wrote the entirety of the buffers, but I can't work out how the reader then knows to ignore most of this?

Edit: It would appear that the previous logic was and still is simply broken...

tustvold · 2022-07-14T18:56:17Z

arrow/src/ipc/writer.rs

        );
+


Suggested change

assert_eq!(

deserialize(serialize_without_truncate(&record_batch_slice)),

record_batch_slice

);

Or something, basically I think the non-sliced logic was and still is wrong.

arrow/src/ipc/writer.rs

viirya · 2022-07-14T18:59:41Z

I was looking for how the reader knows how to ignore it. But I didn't find it. Then I realized that previously the writer wrote the entire buffers. For the reader, it sets offsets from zero and reads the buffers from incorrect positions.

So I think it is a bug that sliced record batch cannot be read correctly after IPC. We can reproduce it:

fn create_batch() -> RecordBatch {
  let schema = Schema::new(vec![
    Field::new("a", DataType::Int32, true),
    Field::new("b", DataType::Utf8, true),
  ]);

  let a = Int32Array::from(vec![1, 2, 3, 4, 5]);
  let b = StringArray::from(vec!["a", "b", "c", "d", "e"]);

  RecordBatch::try_new(Arc::new(schema), vec![Arc::new(a), Arc::new(b)])
    .unwrap()
}

let big_record_batch = create_batch();

let offset = 2;
let record_batch_slice = big_record_batch.slice(offset, 3);

Original record batches:

RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "b", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: fal
se, metadata: None }], metadata: {} }, columns: [PrimitiveArray<Int32>
[                                                                                                                  
  1,
  2,                                                                                                               
  3,
  4,                                                                                                               
  5,
], StringArray 
[
  "a",                                                                                                             
  "b",
  "c",                                                                                                             
  "d",
  "e",         
]], row_count: 5 }
RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "b", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: fal
se, metadata: None }], metadata: {} }, columns: [PrimitiveArray<Int32>
[                                                                                                                  
  3,
  4,
  5,
], StringArray
[
  "c",
  "d",
  "e",
]], row_count: 3 }

Record batches after IPC serialization/deserialzation:

RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "b", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: fal
se, metadata: None }], metadata: {} }, columns: [PrimitiveArray<Int32>                       
[
  1,                                                                                                               
  2,
  3,           
  4,
  5,                                                                                                               
], StringArray
[                                                                                                                                                                                                                                      
  "a",
  "b",         
  "c",
  "d",                                                                                                             
  "e",
]], row_count: 5 }                                                                                                                                                                                                                     
RecordBatch { schema: Schema { fields: [Field { name: "a", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: None }, Field { name: "b", data_type: Utf8, nullable: true, dict_id: 0, dict_is_ordered: fal
se, metadata: None }], metadata: {} }, columns: [PrimitiveArray<Int32>
[
  null,                                                                                                            
  null,
  5,                                                                                                               
], StringArray
[              
  "a",
  "b",                                                                                                             
  "c",
]], row_count: 3 }

viirya · 2022-07-14T19:01:22Z

That's said that we need to truncate it without the option. I don't find C++ implementation has an option for that actually.

tustvold · 2022-07-14T19:06:25Z

Perhaps we should make it non-optional then, I had thought the offset was separately preserved, given it isn't, there doesn't seem to be a reason to have the option

viirya · 2022-07-14T20:56:52Z

Unrelated error on Windows CI:

error: error: Invalid value for '<+toolchain>': Toolchain overrides must begin with '+'

tustvold · 2022-07-14T21:17:22Z

I think there is more to be done in this vein, e.g. ListArray, StructArray, etc... but this is a good step forward. Thank you 🙂

viirya · 2022-07-14T21:19:21Z

Thanks @alamb @tustvold @JasonLi-cn @HaoYang670

viirya · 2022-07-14T21:20:23Z

I don't find C++ implementation does truncation for ListArray, StructArray, but maybe I miss it when reading the code. I will re-check this.

ursabot · 2022-07-14T21:21:31Z

Benchmark runs are scheduled for baseline = 5e520bb and contender = 86543a4. 86543a4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

alamb · 2022-07-15T13:42:51Z

I think there is more to be done in this vein, e.g. ListArray, StructArray, etc... but this is a good step forward. Thank you 🙂

@tustvold is the remaining work recorded anywhere? I am happy to write up a ticket but I don't understand the issue well enough to do so at this moment

tustvold · 2022-07-15T14:44:23Z

Filed #2080 to track follow up work, I personally think we should accelerate efforts to remove offset from ArrayData, as opposed to continuing to work around it in various places

Truncate IPC record batch

7d5cfd9

github-actions bot added the arrow Changes to the arrow crate label Jul 11, 2022

Fix clippy

bd63562

viirya added 2 commits July 10, 2022 19:11

Fix clippy

8da6f94

Check deserilized record batch

52c9273

HaoYang670 reviewed Jul 11, 2022

View reviewed changes

arrow/src/ipc/writer.rs Show resolved Hide resolved

arrow/src/ipc/writer.rs Show resolved Hide resolved

arrow/src/ipc/writer.rs Outdated Show resolved Hide resolved

For review

ffb2ff2

tustvold reviewed Jul 12, 2022

View reviewed changes

alamb mentioned this pull request Jul 12, 2022

RecordBatch: Serialization of sliced record using StreamWriter produces incorrect resut #1528

Closed

viirya added 3 commits July 12, 2022 13:35

Truncate DictionaryArray. Fix null buffer truncation. Add null value …

4c3fb6b

…test.

Use BufferBuilder

6f477d0

Add option to IpcWriteOptions

a9bd443

JasonLi-cn reviewed Jul 14, 2022

View reviewed changes

arrow/src/ipc/writer.rs Outdated Show resolved Hide resolved

For review

311ff3f

tustvold reviewed Jul 14, 2022

View reviewed changes

arrow/src/ipc/writer.rs Outdated Show resolved Hide resolved

viirya added 2 commits July 14, 2022 12:35

Remove truncate option. Revise test for review.

2dcf751

Trigger Build

50c4eb7

tustvold approved these changes Jul 14, 2022

View reviewed changes

tustvold merged commit 86543a4 into apache:master Jul 14, 2022

tustvold mentioned this pull request Jul 15, 2022

Support Sliced Complex Types In IPC Writer #2080

Closed

viirya mentioned this pull request Jul 15, 2022

Add IPC truncation test case for StructArray #2082

Closed

alamb mentioned this pull request Jan 6, 2023

Improve ability of FlightDataEncoder to respect max_flight_data_size for certain data types (strings, dictionaries, etc) #3478

Open

tustvold mentioned this pull request Jun 13, 2023

IPC Writer Truncate Sliced List Values #4409

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Truncate IPC record batch #2040

Truncate IPC record batch #2040

viirya commented Jul 11, 2022 •

edited by alamb

codecov-commenter commented Jul 11, 2022 •

edited

HaoYang670 left a comment

tustvold left a comment

tustvold Jul 12, 2022

alamb commented Jul 12, 2022

viirya commented Jul 12, 2022

alamb commented Jul 12, 2022

tustvold commented Jul 14, 2022 •

edited

tustvold Jul 14, 2022 •

edited

viirya commented Jul 14, 2022

viirya commented Jul 14, 2022

tustvold commented Jul 14, 2022

viirya commented Jul 14, 2022

tustvold commented Jul 14, 2022

viirya commented Jul 14, 2022

viirya commented Jul 14, 2022

ursabot commented Jul 14, 2022

alamb commented Jul 15, 2022

tustvold commented Jul 15, 2022

+        assert_eq!(
+            deserialize(serialize_without_truncate(&record_batch_slice)),
+            record_batch_slice
+        );

Truncate IPC record batch #2040

Truncate IPC record batch #2040

Conversation

viirya commented Jul 11, 2022 • edited by alamb

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter commented Jul 11, 2022 • edited

Codecov Report

HaoYang670 left a comment

Choose a reason for hiding this comment

tustvold left a comment

Choose a reason for hiding this comment

tustvold Jul 12, 2022

Choose a reason for hiding this comment

alamb commented Jul 12, 2022

viirya commented Jul 12, 2022

alamb commented Jul 12, 2022

tustvold commented Jul 14, 2022 • edited

tustvold Jul 14, 2022 • edited

Choose a reason for hiding this comment

viirya commented Jul 14, 2022

viirya commented Jul 14, 2022

tustvold commented Jul 14, 2022

viirya commented Jul 14, 2022

tustvold commented Jul 14, 2022

viirya commented Jul 14, 2022

viirya commented Jul 14, 2022

ursabot commented Jul 14, 2022

alamb commented Jul 15, 2022

tustvold commented Jul 15, 2022

viirya commented Jul 11, 2022 •

edited by alamb

codecov-commenter commented Jul 11, 2022 •

edited

tustvold commented Jul 14, 2022 •

edited

tustvold Jul 14, 2022 •

edited