ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics #11872

ahmet-uyar · 2021-12-06T15:33:07Z

Add statistics about both the original buffer sizes and the padded/compressed sizes.

github-actions · 2021-12-06T15:33:28Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on JIRA? https://issues.apache.org/jira/browse/ARROW

Opening JIRAs ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

pitrou · 2021-12-06T15:56:57Z

Hi @ahmet-uyar , can you ensure the PR title is properly formatted? See the automated comment above for guidelines. Thank you!

pitrou

Thanks for the submission @ahmet-uyar ! Here are some comments below. Feel free to ask questions if not everything is clear.

pitrou · 2021-12-06T15:58:10Z

cpp/src/arrow/ipc/writer.h

+
+  /// compression ratio for the body of all record batches serialized
+  /// this is equivalent to:
+  ///    serialized_body_length / raw_body_length


Since this is equivalent, I don't think there is any value in exposing it. People can trivially calculate it themselves.

removed compression ratio from WriteStats.

pitrou · 2021-12-06T16:00:04Z

cpp/src/arrow/ipc/writer.h

+  /// initial and serialized (may be compressed) body lengths for record batches
+  /// these values show the total sizes of all record batch body lengths
+  int64_t raw_body_length = 0;
+  int64_t serialized_body_length = 0;


In the Arrow C++ APIs, "length" generally points to the logical number of elements (see e.g. Array::length()), while "size" points to the physical size in bytes (as in Buffer::size()). So I think this should be:

int64_t total_raw_body_size = 0; int64_t total_compressed_body_size = 0;

Sometimes it is not compressed, only serialized. Should we name it as total_serialized_body_size?

Yes, why not.

pitrou · 2021-12-06T16:02:14Z

cpp/src/arrow/ipc/read_write_test.cc

+  ASSERT_OK_AND_ASSIGN(write_options2.codec, util::Codec::Create(Compression::LZ4_FRAME));
+
+  // pre-computed compression ratios for record batches with Compression::LZ4_FRAME
+  std::vector<float> comp_ratios{1.0f, 0.64f, 0.79924363f};


I don't think we want to hard-code this. The values can vary depending on the version of the lz4 library, or internal details of how we initialize the compressor. Just testing that some compression happens should be sufficient.

added raw record batch sizes instead.

pitrou · 2021-12-06T16:05:46Z

cpp/src/arrow/ipc/read_write_test.cc

+  // ARROW-8823: Calculating the compression ratio
+  FileWriterHelper helper;
+  IpcWriteOptions write_options1 = IpcWriteOptions::Defaults();
+  IpcWriteOptions write_options2 = IpcWriteOptions::Defaults();


Can you give these clearer names, e.g. options_uncompressed and options_compressed?

pitrou · 2021-12-06T16:06:23Z

cpp/src/arrow/ipc/writer.h

@@ -75,6 +76,19 @@ struct WriteStats {
  /// Number of replaced dictionaries (i.e. where a dictionary batch replaces
  /// an existing dictionary with an unrelated new dictionary).
  int64_t num_replaced_dictionaries = 0;
+
+  /// initial and serialized (may be compressed) body lengths for record batches
+  /// these values show the total sizes of all record batch body lengths


Will this also include the dictionary batches?

pitrou · 2021-12-06T16:12:32Z

cpp/src/arrow/ipc/read_write_test.cc

+  batches[2] =
+    RecordBatch::Make(schema, length, {rg.String(500, 0, 10, 0.1), dict_array});
+
+  for(size_t i = 0; i < batches.size(); ++i) {


As a suggestion for these tests, it would be nice to check that:

the raw body size is accurate (it can be hard-coded, since it should be stable and deterministic)

the compressed body size is equal to or smaller than the raw body size, depending on the compression parameter

Done as suggested.
But there is a slight difference. When a record-batch is serialized, buffer-sizes complemented to the multiple of 8. So when there is no compression, serialized record batch sizes can be slightly larger. In that case, raw-sizes are less than or equal to the serialized size.

In addition, when compression is used, if there is very little data (a few hundred bytes maybe), compressed size can actually be larger than the raw size. But I have not put this case in to the test case. So this is not a problem.

Ah, you're right, the padding can make the size slightly larger. Can you add a comment explaining this?

…/arrow into issue-8823-comp-ratio

github-actions · 2021-12-06T16:34:01Z

https://issues.apache.org/jira/browse/ARROW-8823

github-actions · 2021-12-06T16:34:03Z

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

pitrou

Looks good to me, just some minor nits now.

pitrou · 2021-12-07T13:09:10Z

cpp/src/arrow/ipc/read_write_test.cc

@@ -1727,6 +1727,61 @@ TEST(TestIpcFileFormat, FooterMetaData) {
  ASSERT_TRUE(out_metadata->Equals(*metadata));
 }

+TEST_F(TestWriteRecordBatch, CompressionRatio) {
+  // ARROW-8823: Calculating the compression ratio


Nit: rename this test and update the comment now that we don't compute a ratio anymore?

pitrou · 2021-12-07T13:09:41Z

cpp/src/arrow/ipc/read_write_test.cc

+    ASSERT_OK(helper.WriteBatch(batches[i]));
+    ASSERT_OK(helper.Finish());
+    ASSERT_GE(helper.writer_->stats().total_raw_body_size,
+              helper.writer_->stats().total_serialized_body_size);


Can this be ASSERT_GT instead or will it fail the test?

equality is needed, because one of the record batches has zero rows and both total_raw_body_size and total_serialized_body_size are zero.

cpp/src/arrow/ipc/writer.h

improving documentation Co-authored-by: Antoine Pitrou <pitrou@free.fr>

ursabot · 2021-12-07T19:51:03Z

Benchmark runs are scheduled for baseline = 9cf4275 and contender = 77722d9. 77722d9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Scheduled] ursa-i9-9960x
[Scheduled] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ursabot · 2021-12-08T04:05:42Z

Benchmark runs are scheduled for baseline = 9cf4275 and contender = 77722d9. 77722d9 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️2.22% ⬆️0.74%] ursa-i9-9960x
[Finished ⬇️0.22% ⬆️0.18%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

ahmet-uyar added 3 commits December 3, 2021 15:48

initial implementation of compression ratio

183baff

added a test case, changed compression formula

ff85b2f

calculated raw_body_length in SparseTensorSerializer.Assemple

a63f2d2

github-actions bot added the Component: C++ label Dec 6, 2021

ahmet-uyar added 2 commits December 6, 2021 18:37

initial implementation of compression ratio

16e7758

added a test case, changed compression formula

9a9efae

pitrou requested changes Dec 6, 2021

View reviewed changes

ahmet-uyar added 2 commits December 6, 2021 19:26

calculated raw_body_length in SparseTensorSerializer.Assemple

a7116d1

Merge branch 'issue-8823-comp-ratio' of https://github.com/ahmet-uyar…

924b033

…/arrow into issue-8823-comp-ratio

ahmet-uyar changed the title ~~Arrow-8823 Calculating compression ratio~~ ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages Dec 6, 2021

ahmet-uyar added 6 commits December 6, 2021 20:30

addressed PR comments

65933f4

added a comment and fixed style errors

dda5ee5

style fixes

46b67a3

fixing test failure on amd

91be684

checked compression codec availability when testing

dd7955d

checked codec availability

572fe0f

pitrou reviewed Dec 7, 2021

View reviewed changes

ahmet-uyar and others added 2 commits December 7, 2021 18:38

Apply suggestions from code review

be7f2c8

improving documentation Co-authored-by: Antoine Pitrou <pitrou@free.fr>

renamed test case

43446f9

pitrou changed the title ~~ARROW-8823: [C++] Compute aggregate compression ratio when producing compressed IPC body messages~~ ARROW-8823: [C++] Add total written size of batch buffers to IPC write statistics Dec 7, 2021

pitrou changed the title ~~ARROW-8823: [C++] Add total written size of batch buffers to IPC write statistics~~ ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics Dec 7, 2021

pitrou approved these changes Dec 7, 2021

View reviewed changes

pitrou closed this in 77722d9 Dec 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics #11872

ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics #11872

ahmet-uyar commented Dec 6, 2021 •

edited by pitrou

github-actions bot commented Dec 6, 2021

pitrou commented Dec 6, 2021

pitrou left a comment

pitrou Dec 6, 2021

ahmet-uyar Dec 6, 2021

pitrou Dec 6, 2021

ahmet-uyar Dec 6, 2021

pitrou Dec 6, 2021

pitrou Dec 6, 2021

ahmet-uyar Dec 6, 2021

pitrou Dec 6, 2021

ahmet-uyar Dec 6, 2021

pitrou Dec 6, 2021

ahmet-uyar Dec 6, 2021

pitrou Dec 6, 2021

ahmet-uyar Dec 6, 2021

pitrou Dec 6, 2021

ahmet-uyar Dec 6, 2021

github-actions bot commented Dec 6, 2021

github-actions bot commented Dec 6, 2021

pitrou left a comment

pitrou Dec 7, 2021

ahmet-uyar Dec 7, 2021

pitrou Dec 7, 2021

ahmet-uyar Dec 7, 2021

ursabot commented Dec 7, 2021 •

edited

ursabot commented Dec 8, 2021 •

edited

ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics #11872

ARROW-8823: [C++] Add total size of batch buffers to IPC write statistics #11872

Conversation

ahmet-uyar commented Dec 6, 2021 • edited by pitrou

github-actions bot commented Dec 6, 2021

pitrou commented Dec 6, 2021

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Dec 6, 2021

github-actions bot commented Dec 6, 2021

pitrou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ursabot commented Dec 7, 2021 • edited

ursabot commented Dec 8, 2021 • edited

ahmet-uyar commented Dec 6, 2021 •

edited by pitrou

ursabot commented Dec 7, 2021 •

edited

ursabot commented Dec 8, 2021 •

edited